Most active commenters
  • tedsanders(4)

←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 53 comments | | HN request time: 0.501s | source | bottom
1. breakingcups ◴[] No.46235173[source]
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?

0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...

replies(10): >>46235244 #>>46235267 #>>46236405 #>>46236591 #>>46237241 #>>46239493 #>>46240735 #>>46241534 #>>46241550 #>>46241781 #
2. timerol ◴[] No.46235244[source]
Also a "stacked pair" of USB type-A ports, when there are clearly 4
3. tedsanders ◴[] No.46235267[source]
Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.
replies(9): >>46235823 #>>46236007 #>>46236072 #>>46236155 #>>46236158 #>>46236250 #>>46236355 #>>46238538 #>>46241716 #
4. wilg ◴[] No.46235860{3}[source]
What did Sam Altman say? Or is this more of a vague impression thing?
replies(1): >>46235976 #
5. honeycrispy ◴[] No.46235882{3}[source]
Not sure what you mean, Altman does that fake-humility thing all the time.

It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).

replies(1): >>46235940 #
6. d--b ◴[] No.46235940{4}[source]
I'm confident that GP is good faithed though. Maybe I am falling for it. Who knows? It doesn't really matter, I just wanted to be nice to the guy. It takes some balls posting as OpenAi employee here, and I wish we heard from them more often, as I am pretty sure all of them lurk around.
replies(1): >>46236335 #
7. BoppreH ◴[] No.46236007[source]
That would be a laudable goal, but I feel like it's contradicted by the text:

> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component

I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.

Edit: credit where credit is due, the recently-added disclaimer is nice:

> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.

replies(4): >>46236196 #>>46236246 #>>46236990 #>>46242585 #
8. arscan ◴[] No.46236072[source]
I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.

Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”

replies(1): >>46236436 #
9. minimaxir ◴[] No.46236074{5}[source]
Using ChatGPT to ironically post AI-generated comments is still posting of AI-generated comments.
10. g947o ◴[] No.46236155[source]
When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.

I don't see any advantage in using the tool.

replies(1): >>46236486 #
11. iamdanieljohns ◴[] No.46236158[source]
Is Adaptive Reasoning gone from GPT-5.2? It was a big part of the release of 5.1 and Codex-Max. Really felt like the future.
replies(1): >>46236393 #
12. hnuser123456 ◴[] No.46236196{3}[source]
Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.
replies(1): >>46238203 #
13. ◴[] No.46236246{3}[source]
14. layer8 ◴[] No.46236250[source]
You know what would be great? If it had added some boxes with “might be X or Y, but not sure”.
15. rvnx ◴[] No.46236335{5}[source]
It's the only reasonable choice you can make. As an employee with stock options you do not want to get trashed on Hackernews because this affects your income directly if you try to conduct a secondary share sale or plan to hold until IPO.

Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.

replies(1): >>46237400 #
16. iwontberude ◴[] No.46236355[source]
But it’s completely wrong.
17. tedsanders ◴[] No.46236393{3}[source]
Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.
18. whalesalad ◴[] No.46236405[source]
to be fair that image has the resolution of a flip phone from 2003
replies(2): >>46237625 #>>46239141 #
19. tedsanders ◴[] No.46236436{3}[source]
Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.
20. jacquesm ◴[] No.46236486{3}[source]
That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.

Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.

replies(1): >>46242353 #
21. jasonlotito ◴[] No.46236591[source]
FTA: Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.

You can find it right next to the image you are talking about.

replies(2): >>46236847 #>>46237091 #
22. tedsanders ◴[] No.46236847[source]
To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.

LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.

One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...

They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.

23. furyofantares ◴[] No.46236990{3}[source]
They also changed "roughly match" to "sometimes match".
replies(1): >>46237477 #
24. da_grift_shift ◴[] No.46237091[source]
'Commented after article was already edited in response to HN feedback' award
25. an0malous ◴[] No.46237241[source]
Because the whole culture of AI enthusiasts is to just generate slop and never check the results
26. Esophagus4 ◴[] No.46237400{6}[source]
I know HN commenters like to see themselves as contrarians, as do I sometimes, but man… this seems like a serious stretch to assume such malicious intent that an employee of the world’s top AI name would astroturf a random HN thread about a picture on a blog.

I’m fairly comfortable taking this OpenAI employee’s comment at face value.

Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…

replies(1): >>46238628 #
27. MichaelZuo ◴[] No.46237477{4}[source]
Did they really change a meaningful word like that after publication without an edit note…?
replies(2): >>46237734 #>>46237877 #
28. malfist ◴[] No.46237625[source]
If I ask you a question and you don't have enough information to answer, you don't confidently give me an answer, you say you don't know.

I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.

replies(1): >>46237813 #
29. piker ◴[] No.46237734{5}[source]
Eh, I'm no shill but their marketing copy isn't exactly the New York Times. They're given some license to respond to critical feedback in a manner that makes the statements more accurate without the same expectations of being objective journalism of record.
replies(1): >>46241558 #
30. AstroBen ◴[] No.46237813{3}[source]
No-one should have the expectation LLMs are giving correct answers 100% of the time. It's inherent to the tech for them to be confidently wrong

Code needs to be checked

References need to be checked

Any facts or claims need to be checked

replies(2): >>46238498 #>>46241514 #
31. dwohnitmok ◴[] No.46237877{5}[source]
This has definitely happened before with e.g. the o1 release. I will sometimes use the Wayback Machine to verify changes that have been made.
32. hexaga ◴[] No.46238203{4}[source]
It just overlaid a typical ATX pattern across the motherboard-like parts of the image, even if that's not really what the image is showing. I don't think it's worthwhile to consider this a 'local recognition failure', as if it just happened to mistake CMOS for RAM slots.

Imagine it as a markdown response:

# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)

1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX

2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard

3. ... etc more stuff that is supported only by force of preconception

--

It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.

Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.

33. malfist ◴[] No.46238498{4}[source]
According to the benchmarks here they're claiming up to 97% accuracy. That ought to be good enough to trust them right?

Or maybe these benchmarks are all wrong

replies(3): >>46238863 #>>46242378 #>>46242867 #
34. johnwheeler ◴[] No.46238538[source]
Oh and you guys don't mislead people ever. Your management is just completely trustworthy, and I'm sure all you guys are too. Give me a break, man. If I were you, I would jump ship or you're going to be like a Theranos employee on LinkedIn.
replies(1): >>46241994 #
35. rvnx ◴[] No.46238628{7}[source]
Malicious ? No, and this is far from astroturfing, he even speaks as "we". It's just a logical move to defend your company when people claim your product is buggy.

There is no other logical move, this is what I am saying, contrary to people above say this requires a lot of courage. It's not about courage, it's just normal and logic (and yes Hackernews matters a lot, this place is a very strong source of signal for investors).

Not bad at all, just observing it.

36. AstroBen ◴[] No.46238863{5}[source]
Does code work if it's 97% correct?

It's not okay if claims are totally made up 1/30 times

Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them

replies(1): >>46242386 #
37. redox99 ◴[] No.46239141[source]
It's trivial for a human that knows what a pc looks like. Maybe mistaking displayport for hdmi.
38. 8organicbits ◴[] No.46239493[source]
Promotional content for LLMs is really poor. I was looking at Claude Code and the example on their homepage implements a feature, ignoring a warning about a security issue, commits locally, does not open a PR and then tries to close the GitHub issue. Whatever code it wrote they clearly didn't use as the issue from the prompt is still open. Bizarre examples.
39. fumeux_fume ◴[] No.46240735[source]
General purpose LLMs aren't very good with generating bounding boxes, so with that context, this is actually seen as decent performance for certain use cases.
40. dolmen ◴[] No.46241514{4}[source]
"confidently" is a feature selected in the system prompt.

As a user you can influence that behavior.

41. dolmen ◴[] No.46241534[source]
Not that bad compared to product images seen on AliExpress.
42. tennisflyi ◴[] No.46241550[source]
You seen the charts on their last release? They obviously don’t check - too rich
43. mkesper ◴[] No.46241558{6}[source]
Yes, but they should clearly mark updates. That would be professional.
44. ◴[] No.46241716[source]
45. az226 ◴[] No.46241781[source]
And here is Gemini 3: https://media.licdn.com/dms/image/v2/D5610AQH7v9MtrZxxug/ima...
replies(2): >>46241801 #>>46242682 #
46. saejox ◴[] No.46241801[source]
This is very impressive. Google really is ahead
47. yard2010 ◴[] No.46241994{3}[source]
Hey no need to personally attack anyone. A bad organization can still consist good people.
48. AdamN ◴[] No.46242353{4}[source]
There was a low-level Google internal service that worked so well that other teams took a hard dependency on it (against advice). So the internal team added a cron job to drop it every once in a while to get people to trust it less :-)
49. refactor_master ◴[] No.46242378{5}[source]
Gemini routinely makes up stuff about BigQuery’s workings. “It’s poorly documented”. Well, read the open source code, reason it out.

Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?

50. fooker ◴[] No.46242386{6}[source]
> Does code work if it's 97% correct?

Of course it does. The vast majority of software has bugs. Yes, even critical one like compilers and operating systems.

51. guerrilla ◴[] No.46242585{3}[source]
Leave it to OpenAI to be dishonest about being dishonest. It seems they're also editing this post without notice as well.
52. FinnKuhn ◴[] No.46242682[source]
This is genuinly impressive. The OpenAI equivalent is less detailed AND less correct.
53. JimDabell ◴[] No.46242867{5}[source]
Something that is 97% accurate is wrong 3% of the time, so pointing out that it has gotten something wrong does not contradict 97% accuracy in the slightest.