GPT-5.2

(openai.com)

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

svara ◴[12 Dec 25 08:08 UTC] No.46241936[source]▶

In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.

The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.

It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.

But to me it's very clear that the product that gets this right will be the one I use.

replies(12): >>46241987 #>>46242107 #>>46242173 #>>46242280 #>>46242317 #>>46242483 #>>46242537 #>>46242589 #>>46243494 #>>46243567 #>>46243680 #>>46244002 #

stacktrace ◴[12 Dec 25 08:51 UTC] No.46242173[source]▶

>>46241936 #

> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.

IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.

replies(4): >>46242430 #>>46242681 #>>46242794 #>>46242816 #

1. biofox ◴[12 Dec 25 10:41 UTC] No.46242816[source]▶

>>46242173 #

I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.

replies(3): >>46243213 #>>46243490 #>>46243812 #

2. drclau ◴[12 Dec 25 11:44 UTC] No.46243213[source]▶

>>46242816 (TP) #

How do you know the confidence scores are not hallucinated as well?

replies(2): >>46243327 #>>46243333 #

3. dfsegoat ◴[12 Dec 25 12:01 UTC] No.46243327[source]▶

>>46243213 #

they 100% are unless you provide a RUBRIC / basically make it ordinal.

"Return a score of 0.0 if ...., Return a score of 0.5 if .... , Return a score of 1.0 if ..."

4. kiliankoe ◴[12 Dec 25 12:02 UTC] No.46243333[source]▶

>>46243213 #

They are, the model has no inherent knowledge about its confidence levels, it just adds plausible-sounding numbers. Obviously they _can_ be plausible, but trusting these is just another level up from trusting the original output.

I read a comment here a few weeks back that LLMs always hallucinate, but we sometimes get lucky when the hallucinations match up with reality. I've been thinking about that a lot lately.

replies(2): >>46243440 #>>46244120 #

5. TeMPOraL ◴[12 Dec 25 12:19 UTC] No.46243440{3}[source]▶

>>46243333 #

> the model has no inherent knowledge about its confidence levels

Kind of. See e.g. https://openreview.net/forum?id=mbu8EEnp3a, but I think it was established already a year ago that LLMs tend to have identifiable internal confidence signal; the challenge around the time of DeepSeek-R1 release was to, through training, connect that signal to tool use activation, so it does a search if it "feels unsure".

6. EastLondonCoder ◴[12 Dec 25 12:27 UTC] No.46243490[source]▶

>>46242816 (TP) #

I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.

It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.

People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.

replies(1): >>46243833 #

7. ryoshu ◴[12 Dec 25 13:13 UTC] No.46243812[source]▶

>>46242816 (TP) #

LLMs fail at causal accuracy. It's a fundamental problem with how they work.

8. kznewman ◴[12 Dec 25 13:16 UTC] No.46243833[source]▶

>>46243490 #

Solid agree. Hallucination for me IS the LLM use case. What I am looking for are ideas that may or may not be true that I have not considered and then I go try to find out which I can use and why.

replies(1): >>46244020 #

9. sheeshe ◴[12 Dec 25 13:40 UTC] No.46244020{3}[source]▶

>>46243833 #

In essence it is a thing that is actually promoting your own brain… seems counter intuitive but that’s how I believe this technology should be used.

10. fragmede ◴[12 Dec 25 13:50 UTC] No.46244120{3}[source]▶

>>46243333 #

In science, before LLMs, there's this saying: all models are wrong, some are useful. We model, say, gravity as 9.8m/s² on Earth, knowing full well that it doesn't hold true across the universe, and we're able to build things on top of that foundation. Whether that foundation is made of bricks, or is made of sand, for LLMs, is for us to decide.

↑