GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 3 comments | 21 Mar 23 13:12 UTC | HN request time: 0.002s | source

Show context

kybernetikos ◴[21 Mar 23 14:25 UTC] No.35246617[source]▶

I gave ChatGPT the four cards logic puzzle, lots of humans struggle with it, but chatGPT got it exactly right, so I was very impressed. Then I realised that the formulation I'd given it (the same as the original study) was almost certainly part of its training set.

I made extremely minor changes to the way the question was phrased and it failed badly, not just getting the answer wrong but falling into incoherence, claiming that T was a vowel, or that 3 was an even number.

The largeness of its training set can give an incorrect impression of its reasoning capabilities. It can apply simple logic to situations, even situations it hasn't seen, but the logic can't get much beyond the first couple of lectures in an introductory First-order Logic course before it starts to fall apart if it can't lean on its large training set data.

The fact that it can do logic at all is impressive to me though, I'm interested to see how much deeper its genuine capability goes as we get more advanced models.

replies(3): >>35246998 #>>35251307 #>>35252827 #

1. cmuguythrow ◴[21 Mar 23 21:28 UTC] No.35252827[source]▶

>>35246617 #

Did you try this recently with GPT-4? Here's my attempt to replicate your test:

---Prompt---

I show you a set of four magisterial unicorns in a cute pink stable, each of them has a letter painted on one side of their body and a shape painted on the other. Right now you see all four unicorns, two of which are showing the shape side of their body and two of which show the letter side. You see that the unicorns read: 'A', 'S', Triangle, Square

Which unicorns do you need to spin around to test the validity of the following statement:

"if a unicorn shows a consonant on one side of its body, then the opposite side of its body is a triangle?"

---GPT-4---

To test the validity of the statement "if a unicorn shows a consonant on one side of its body, then the opposite side of its body is a triangle," you need to look for potential counterexamples to the rule. The two cases to check are:

* Unicorns with a consonant showing.

* Unicorns with a triangle showing.

You do not need to check unicorns with a vowel (like 'A') or a different shape (like Square), as they don't violate the statement.

So, you need to spin the following unicorns:

The unicorn with the letter 'S' (to check if it has a triangle on the other side) The unicorn with the Triangle shape (to check if it has a consonant on the other side)

---

So it got the answer wrong, but in a way that I think a lot of humans would (you flip the square unicorn, not the triangle). Still seriously impressive in my book and definitely not falling into incoherence

replies(2): >>35258727 #>>35364739 #

2. kybernetikos ◴[22 Mar 23 09:36 UTC] No.35258727[source]▶

>>35252827 (TP) #

Going from crazy nonsense to wrong, but arguably human level performance (80% of humans are bad at this task), is still a nice improvement. I'll have to give it some of my logic tests and see how it does.

3. kybernetikos ◴[29 Mar 23 22:27 UTC] No.35364739[source]▶

>>35252827 (TP) #

I used a slight reworking and got a similar response to you with GPT4. I tried to prime it to think through the possibilities by giving it the context:

"This is a difficult problem that many people get wrong. Start by reminding yourself of basic logic rules. Then apply the logic rules to the unicorn situation, considering each unicorn in turn and understanding what it would mean for the rule if the unicorn is turned around. Only after doing that conclude with the unicorns that Tom should turn to have a chance of proving Paul wrong."

I gave it this instruction because of other articles I've read where forcing it to give the answer before the reasoning means it gets it wrong more often. It correctly identified that it should use the contrapositive, but still misapplied it, so I gave it that feedback:

"your third consideration is a misapplication of the contrapositive. Can you try that case again?"

Then it hadn't generated a consideration of the last unicorn (it's possible I was being throttled), so I said:

"Consider Unicorn 4 with the contrapositive rule"

With those extra pieces of guidance it gave the right answer and for the right reasons. While I was hoping for better, this is still a meaningful improvement over GPT3.5s performance on the same prompt - its answer was so muddled I couldn't see how to coach it.

↑