GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 2 comments | 21 Mar 23 13:12 UTC | HN request time: 0.013s | source

Show context

thwayunion ◴[21 Mar 23 13:28 UTC] No.35245821[source]▶

Absolutely correct.

We already know this is about self-driving cars. Passing a driver's test was already possible in 2015 or so, but SDCs clearly aren't ready for L5 deployment even today.

There are also a lot of excellent examples of failure modes in object detection benchmarks.

Tests, such as driver's tests or standardized exams, are designed for humans. They make a lot of entirely implicit assumptions about failure modes and gaps in knowledge that are uniquely human. Automated systems work differently. They don't fail in the same way that humans fail, and therefore need different benchmarks.

Designing good benchmarks that probe GPT systems for common failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems, IME.

replies(12): >>35245981 #>>35246141 #>>35246208 #>>35246246 #>>35246355 #>>35246446 #>>35247376 #>>35249238 #>>35249439 #>>35250684 #>>35251205 #>>35252879 #

zer00eyz ◴[21 Mar 23 13:43 UTC] No.35245981[source]▶

>>35245821 #

> good benchmarks ... failure modes and weaknesses is actually quite difficult. Much more difficult than designing or training these systems

Is it? Based on the restrictions placed on the systems we see today and the way people are breaking it, I would say that some failure modes are known.

replies(2): >>35246061 #>>35246078 #

brookst ◴[21 Mar 23 13:51 UTC] No.35246078[source]▶

>>35245981 #

I think the hard / unknown part is how you know you’ve identified all of the failure modes that need to be tested.

Tests of humans have evolved over a long time and large sample size, and humans may be more similar to each other than LLMs are, so failure modes may be more universal.

But very short history, small sample size, and diversity of architecture and training means we really don’t know how to test and measure LLMs. Yes, some failure modes are known, but how many are not?

replies(1): >>35246724 #

1. zer00eyz ◴[21 Mar 23 14:31 UTC] No.35246724[source]▶

>>35246078 #

>. Tests of humans have evolved over a long time and large sample size, and humans may be more similar to each other than LLMs are, so failure modes may be more universal.

In reading this the idea that sociopaths and psychopaths pass as "normal" springs to mind.

Is what an LLM doing any different than what these people do?

https://medium.datadriveninvestor.com/the-best-worst-funnies...

For people language is spoken before it is written... there is a lot of biology in the spoken word (visual and audio queue)... I think without these these sorts of models are going to hit a wall pretty quickly.

replies(1): >>35251506 #

2. brookst ◴[21 Mar 23 19:45 UTC] No.35251506[source]▶

>>35246724 (TP) #

> In reading this the idea that sociopaths and psychopaths pass as "normal" springs to mind.

> Is what an LLM doing any different than what these people do?

I think it's too big of a question to have any meaning. Which sociopaths? Which LLMs? For what differences? It's like asking "is a car any different from an airplane"? Yes, obviously in some ways. No, they are identical in other ways.

↑