←back to thread

GPT-5.2

(openai.com)

1053 points atgctg | 3 comments | 11 Dec 25 18:04 UTC | HN request time: 0.624s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

zone411 ◴[11 Dec 25 19:46 UTC] No.46236209[source]▶

>>46234788 (OP) #

I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

replies(4): >>46236325 #>>46236642 #>>46237650 #>>46241682 #

Donald ◴[11 Dec 25 19:57 UTC] No.46236325[source]▶

Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive

replies(2): >>46236367 #>>46236593 #

1. capitainenemo ◴[11 Dec 25 20:01 UTC] No.46236367[source]▶

And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

replies(1): >>46243248 #

2. wooger ◴[12 Dec 25 11:50 UTC] No.46243248[source]▶

>>46236367 (TP) #

> unless I guess they routinely index this repo

This sounds like exactly the kind of thing any tech company would do when confronted with a competitive benchmark.

replies(1): >>46244019 #

3. rsanek ◴[12 Dec 25 13:40 UTC] No.46244019[source]▶

I mean, the repo has <200 stars, it's not like it's so mainstream that you'd expect LLM makers to be watching it actively. If they wanted to game it, they could more easily do that in RL with synthetic data anyway.