GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 3 comments | 21 Mar 23 13:12 UTC | HN request time: 0s | source

Show context

jrochkind1 ◴[21 Mar 23 14:23 UTC] No.35246560[source]▶

Just realized that this is yet another example of the category of "imperfect measurement" that HN likes to talk about so much -- Goodhart's law and such.

The bar exam is a proxy for actually being a competent lawyer. But it's an imperfect proxy. It seems obvious that it will be possible for a computer program to target getting good on the bar exam, and be good at the bar exam without being a competent lawyer. It may also be possible for a human to? But hard enough that it's still a reasonable proxy when it's humans?

replies(2): >>35246726 #>>35247002 #

Verdex ◴[21 Mar 23 14:50 UTC] No.35247002[source]▶

>>35246560 #

Yeah, a good thought experiment (or possibly even a good experiment to just straight up run) is to try and decide what's the simplest computer program that could possibly pass the bar exam. And then ask people if they would feel comfortable with that program being their lawyer.

So the most obvious solution is to steal the answers and then have the computer blindly paste them into the test, although critics might say that this is 'obviously' not the same as was the language model is doing.

I suspect you could pass the bar exam with a bunch of regexes that have an associated formattable string and/or answer result. If someone actually did this, I wonder if people would still be excited about language model techniques passing the bar exam.

replies(1): >>35247946 #

1. jrochkind1 ◴[21 Mar 23 15:51 UTC] No.35247946[source]▶

>>35247002 #

I mean, stealing the answers (or even just the questions) would be cheating, of course stolen answers would make it a lot easier for a human to pass too. Nobody is surprised that if you cheat then the exam is no longer a good proxy measure for professional competency! Nobody expects or intends it to be a good proxy measure for those who have stolen the questions.

I actually doubt you could write software to pass the bar exam with "a bunch of regexes that have an associated formattable string and/or answer result." I'm not even sure what that means, but I suspect you aren't familiar with bar exams. They are very hard for humans that are in fact familiar with the material; they can contain "trick" questions and require thinking about edge cases, etc. They generally include both essay questions and multiple-choice -- and it can be very tricky multiple choice.

Here's just one of the first samples I found googling, I have trouble imagining "regexes with associated answers" doing anything useful here: https://barexam.virginia.gov/pdf/essays/2022%20February%20VA...

Here are some multiple choice examples: https://www.ncbex.org/pdfviewer/?file=%2Fdmsdocument%2F17

i'd be shocked. But feel free to spend a couple years trying to prove me wrong!

replies(1): >>35253355 #

2. vageli ◴[21 Mar 23 22:10 UTC] No.35253355[source]▶

>>35247946 (TP) #

A bit off topic but I enjoyed reading the bar exam questions you linked and also found that Virginia publishes their answers. For those interested, you can view them by year (the answers also contain the question text): https://barexam.virginia.gov/bar/barsampleanswer.html

replies(1): >>35254714 #

3. jrochkind1 ◴[22 Mar 23 00:07 UTC] No.35254714[source]▶

>>35253355 #

It has been pointed out by many that computer programming thinking and lawyer thinking use similar kinds of mental approaches!

↑