←back to thread

GPT-5.2

(openai.com)
1053 points atgctg | 3 comments | | HN request time: 0.624s | source
Show context
zone411 ◴[] No.46236209[source]
I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

replies(4): >>46236325 #>>46236642 #>>46237650 #>>46241682 #
Donald ◴[] No.46236325[source]
Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive
replies(2): >>46236367 #>>46236593 #
1. capitainenemo ◴[] No.46236367[source]
And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

replies(1): >>46243248 #
2. wooger ◴[] No.46243248[source]
> unless I guess they routinely index this repo

This sounds like exactly the kind of thing any tech company would do when confronted with a competitive benchmark.

replies(1): >>46244019 #
3. rsanek ◴[] No.46244019[source]
I mean, the repo has <200 stars, it's not like it's so mainstream that you'd expect LLM makers to be watching it actively. If they wanted to game it, they could more easily do that in RL with synthetic data anyway.