GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 1 comments | 21 Mar 23 13:12 UTC | HN request time: 0.215s | source

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

dcolkitt ◴[21 Mar 23 13:56 UTC] No.35246141[source]▶

>>35245821 #

I'd also add that the almost all standardized tests are designed for introductory material across millions of people. That kind of information is likely to be highly represented in the training corpus. Whereas most jobs require highly specialized domain knowledge that's probably not well represented in the corpus, and probably too expansive to fit into the context window.

Therefore standardized tests are probably "easy mode" for GPT, and we shouldn't over-generalize its performance there to its ability to actually add economic value in actually economically useful jobs. Fine-tuning is maybe a possibility, but its expensive and fragile, and I don't think its likely that every single job is going to get a fine-tuned version of GPT.

replies(2): >>35246365 #>>35246438 #

kolbe ◴[21 Mar 23 14:16 UTC] No.35246438[source]▶

>>35246141 #

To add further, these parlor tricks are nothing new. Watson won Jeopardy in 2011, and never produced anything useful. Doing well on the SAT is just another slight-of-hand trick to distract us from the fact that it doesn't really do anything beyond aggregate online information.

replies(1): >>35248521 #

1. WalterSear ◴[21 Mar 23 16:28 UTC] No.35248521[source]▶

>>35246438 #

The issue at hand is that a huge number of people make a living by aggregating online information. They might convey this to others via speech, but the 'human touch' isn't always adding anything to the interaction.

↑