GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 2 comments | 21 Mar 23 13:12 UTC | HN request time: 0.399s | source

Show context

kybernetikos ◴[21 Mar 23 14:25 UTC] No.35246617[source]▶

I gave ChatGPT the four cards logic puzzle, lots of humans struggle with it, but chatGPT got it exactly right, so I was very impressed. Then I realised that the formulation I'd given it (the same as the original study) was almost certainly part of its training set.

I made extremely minor changes to the way the question was phrased and it failed badly, not just getting the answer wrong but falling into incoherence, claiming that T was a vowel, or that 3 was an even number.

The largeness of its training set can give an incorrect impression of its reasoning capabilities. It can apply simple logic to situations, even situations it hasn't seen, but the logic can't get much beyond the first couple of lectures in an introductory First-order Logic course before it starts to fall apart if it can't lean on its large training set data.

The fact that it can do logic at all is impressive to me though, I'm interested to see how much deeper its genuine capability goes as we get more advanced models.

replies(3): >>35246998 #>>35251307 #>>35252827 #

1. joenot443 ◴[21 Mar 23 19:30 UTC] No.35251307[source]▶

>>35246617 #

I'd never heard of that puzzle, seems like a great test for ChatGPT though. Wikipedia defines the problem as:

You are shown a set of four cards placed on a table, each of which has a number on one side and a colored patch on the other side. The visible faces of the cards show 3, 8, red and brown. Which card(s) must you turn over in order to test the truth of the proposition that if a card shows an even number on one face, then its opposite face is red?

replies(1): >>35359090 #

2. amai ◴[29 Mar 23 15:58 UTC] No.35359090[source]▶

>>35251307 (TP) #

If it's in Wikipedia chatGPT has probably already seen and memorised it.

↑