GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

jstummbillig ◴[21 Mar 23 14:03 UTC] No.35246246[source]▶

>>35245821 #

> Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

What do you think is the difficulty?

replies(1): >>35246300 #

thwayunion ◴[21 Mar 23 14:07 UTC] No.35246300[source]▶

>>35246246 #

A good benchmark provides a strong quantitative or qualitative signal that a model has a specific capability, or does not have a specific flaw, within a given operating domain.

Each part of this difficult -- identifying/characterizing the operating domain, figuring out how the empirically characterize a general abstract capability, figuring out how to empirically characterize a specific type of flaw, and characterizing the degree of confidence that a benchmark result gives within the domain. To say nothing of the actual work of building the benchmark.

replies(1): >>35246375 #

1. jstummbillig ◴[21 Mar 23 14:13 UTC] No.35246375[source]▶

>>35246300 #

Sure – but how does this specificially concern GPT like systems? Why not test them for concrete qualifications in the way we test humans, using the tests we already designed to test concrete qualifications in humans?

replies(3): >>35246479 #>>35246588 #>>35248793 #

2. sebzim4500 ◴[21 Mar 23 14:18 UTC] No.35246479[source]▶

>>35246375 (TP) #

The difference is the impact of contaminated datasets. Exam boards tend to reuse questions, either verbatim or slightly modified. This is not such a problem for assessing humans, because it is easier for a human to learn the material than to learn 25 years of prior exams. Clearly that is not the case for current LLMs.

3. thwayunion ◴[21 Mar 23 14:24 UTC] No.35246588[source]▶

>>35246375 (TP) #

Again, because machines have different failure modes than humans.

4. simiones ◴[21 Mar 23 16:43 UTC] No.35248793[source]▶

>>35246375 (TP) #

To take a simplistic example, because a human who can provide a long motivated solution to a math problem that you re-use every three years likely understands the math behind it, while an LLM providing the same solution is likely just copying it from the training set and would be fully unable to resolve a similar problem that did not appear in the training set.

Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

Also, the opposite problem will often exist - if the exam is provided in the wrong format to the AI, we may underestimate its abilities (i.e. a very similar prompt may elicit a significantly better response).

replies(2): >>35249704 #>>35251232 #

5. thwayunion ◴[21 Mar 23 17:44 UTC] No.35249704[source]▶

>>35248793 #

> Lots of exams are designed to prove certain knowledge given safe assumptions of the known limitations of humans, which are completely wrong for machines. The relative difficulty of rote memorization versus having an accurate domain model is perhaps the most obvious one, but there are others.

This paragraph is a gem. Well said.

6. jstummbillig ◴[21 Mar 23 19:24 UTC] No.35251232[source]▶

>>35248793 #

I don't think this is obvious at all. Sure, it's easy enough to make mechanistic arguments (after all, we don't even really understand most of the mechanics on either side, human and ai) but that doesn't mean it will matter in the slightest when we evaluate the outcome in regards to any metric we care about.

Could be tho, of course.

replies(1): >>35269474 #

7. thwayunion ◴[23 Mar 23 01:38 UTC] No.35269474{3}[source]▶

>>35251232 #

It's extremely obvious to anyone who works on real systems.

> (after all, we don't even really understand most of the mechanics on either side, human and ai)

We don't need mechanistic explanations to observe radical differences in behavior, and there are mechanistic explanations for some of these differences.

Eg, CNNs and the visual cortex. We really do understand some mechanisms -- of both CNNs and VCs -- well enough to understand divergences in failure modes. Adversarial examples, for example.

> Sure, it's easy enough to make mechanistic arguments, but that doesn't mean it will matter in the slightest when we evaluate the outcome in regards to any metric we care about.

I can't quite figure out what this sequence of tokens is supposed to mean.

Anyways, again, the failure modes of LLMs are obviously different than the failure modes of humans. Anyone who has spent even a trivial amount of time training both will instantly observe that this is true.

↑