GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 1 comments | 21 Mar 23 13:12 UTC | HN request time: 0.194s | source

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

Waterluvian ◴[21 Mar 23 14:17 UTC] No.35246446[source]▶

>>35245821 #

On topic of the driver's test analogy: I've known people who have passed the test and still said, "I'm don't yet feel ready to drive during rush hour or in downtown Toronto." And then at some point in the future they then recognize that they are ready and wade into trickier situations.

I wonder how self-aware these systems can be? Could ChatGPT be expected to say things like, "I can pass a state bar exam but I'm not ready to be a lawyer because..."

replies(3): >>35246728 #>>35246735 #>>35246955 #

tsukikage ◴[21 Mar 23 14:32 UTC] No.35246735[source]▶

>>35246446 #

The problem ChatGPT and the other language models currently in the zeitgeist are trying to solve is, "given this sequence of symbols, what is a symbol that is likely to come next, as rated by some random on fiverr.com?"

Turns out that this is sufficient to autocomplete things like written tests.

Such a system is also absolutely capable of coming up with sentences like "I can pass a state bar exam but I'm not ready to be a lawyer because..." - or, indeed, sentences with the opposite meaning.

It would, however, be a mistake to draw any conclusions about the system's actual capabilities and/or modes of failure from the things its outputs mean to the human reader; much the same way that if you have dice with a bunch of words on and you roll "I", "am", "sentient" in that order, this event is not yet evidence for the dice's sentience.

replies(2): >>35246804 #>>35259936 #

1. IIAOPSW ◴[22 Mar 23 12:19 UTC] No.35259936[source]▶

>>35246735 #

It is evidence, just not great evidence on its own. Now if you rolled the dice a few dozen times and it came out outrageously skewed towards "I" "am" "sentient", maybe its time to consider the possibility the dice are sentient.

↑