←back to thread

GPT-5.2

(openai.com)
1053 points atgctg | 2 comments | | HN request time: 0.578s | source
Show context
simonw ◴[] No.46235580[source]
Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...
replies(12): >>46235608 #>>46236119 #>>46236455 #>>46236615 #>>46236751 #>>46236849 #>>46237862 #>>46237969 #>>46238631 #>>46239729 #>>46240577 #>>46240638 #
Stevvo ◴[] No.46236849[source]
The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
replies(3): >>46237560 #>>46240319 #>>46241401 #
1. golly_ned ◴[] No.46237560[source]
Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.
replies(2): >>46239011 #>>46239406 #
2. ◴[] No.46239011[source]