(openai.com)

1053 points atgctg | 2 comments | 11 Dec 25 18:04 UTC | HN request time: 0.578s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

simonw ◴[11 Dec 25 19:01 UTC] No.46235580[source]▶

Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...

replies(12): >>46235608 #>>46236119 #>>46236455 #>>46236615 #>>46236751 #>>46236849 #>>46237862 #>>46237969 #>>46238631 #>>46239729 #>>46240577 #>>46240638 #

Stevvo ◴[11 Dec 25 20:40 UTC] No.46236849[source]▶

>>46235580 #

The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.

replies(3): >>46237560 #>>46240319 #>>46241401 #

1. golly_ned ◴[11 Dec 25 21:40 UTC] No.46237560[source]▶

>>46236849 #

Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.

replies(2): >>46239011 #>>46239406 #

2. ◴[11 Dec 25 23:48 UTC] No.46239011[source]▶

>>46237560 (TP) #

↑

GPT-5.2