←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 4 comments | | HN request time: 0.471s | source
1. Stevvo ◴[] No.46236849[source]
The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
replies(3): >>46237560 #>>46240319 #>>46241401 #
2. golly_ned ◴[] No.46237560[source]
Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.
replies(2): >>46239011 #>>46239406 #
3. ◴[] No.46239011[source]
4. getnormality ◴[] No.46241401[source]
Well, the variance is itself interesting.