GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 1 comments | 21 Mar 23 13:12 UTC | HN request time: 0.354s | source

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

zer00eyz ◴[21 Mar 23 13:43 UTC] No.35245981[source]▶

>>35245821 #

> good benchmarks ... failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems

Is it? Based on the restrictions placed on the systems we see today and the way people are breaking it, I would say that some failure modes are known.

replies(2): >>35246061 #>>35246078 #

1. thwayunion ◴[21 Mar 23 13:50 UTC] No.35246061[source]▶

>>35245981 #

A good benchmark is not simply a set of unit tests.

What you want in a benchmark is a set of things you can use to measure general improvement; doing better should decrease the propensity of a particular failure mode. Doing this in a way that generalizes beyond specific sub-problems, or even specific inputs in the benchmark suite, is difficult. Building a benchmark suite that's large and comprehensive enough that generalization isn't necessary is also a challenge.

Think about an analogy to software security. Exploiting a SQL injection attack in insecure code is easy. Coming up with a set of unit tests that ensures an entire black box software system is free of SQL injection attacks is quite a bit more difficult. Red teaming vs blue teaming, except the blue team doesn't get source code in this case. So the security guarantee has to come from unit tests alone, not systematic design decisions. Just like in software security, knowing that you've systematically eliminated a problem is much more difficult than finding one instance of the problem.

↑