GPT-5.2

(openai.com)

1053 points atgctg | 1 comments | 11 Dec 25 18:04 UTC | HN request time: 0.192s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

svara ◴[12 Dec 25 08:08 UTC] No.46241936[source]▶

In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.

The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.

It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.

But to me it's very clear that the product that gets this right will be the one I use.

replies(12): >>46241987 #>>46242107 #>>46242173 #>>46242280 #>>46242317 #>>46242483 #>>46242537 #>>46242589 #>>46243494 #>>46243567 #>>46243680 #>>46244002 #

stacktrace ◴[12 Dec 25 08:51 UTC] No.46242173[source]▶

>>46241936 #

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.

IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.

replies(4): >>46242430 #>>46242681 #>>46242794 #>>46242816 #

XCSme ◴[12 Dec 25 10:37 UTC] No.46242794[source]▶

>>46242173 #

But most benchmarks are not about that...

Are there even any "hallucination" public benchmarks?

replies(1): >>46243002 #

1. andrepd ◴[12 Dec 25 11:13 UTC] No.46243002[source]▶

>>46242794 #

"Benchmarks" for LLMs are a total hoax, since you can train them on the benchmarks themselves.

↑