←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 1 comments | | HN request time: 0s | source
Show context
josalhor ◴[] No.46235005[source]
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #
verdverm ◴[] No.46235062[source]
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #
brokensegue ◴[] No.46235466[source]
how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate
replies(1): >>46235676 #
1. verdverm ◴[] No.46235676{3}[source]
more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.

Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far

Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think

https://www.youtube.com/watch?v=uaZ3yRdYg8A