←back to thread

GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 1 comments | 21 Mar 23 13:12 UTC | HN request time: 0.22s | source

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

>>35245626 (OP) #

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

1. rileymat2 ◴[21 Mar 23 17:26 UTC] No.35249439[source]▶

> There are also a lot of excellent examples of failure modes in object detection benchmarks.

I am curious if there are counter examples with better object detection. As a kid I used to see faces and to some extent still do in the dark. This is a really common thing that the human brain does. https://www.wired.com/story/why-humans-see-faces-everyday-ob... https://en.wikipedia.org/wiki/Pareidolia

Part of me wonder if in the face of novel environments that a sufficiently intelligent system needs to make these errors. But AI errors will always be different than human errors like you say.