GPT-4 and professional benchmarks: the wrong answer to the wrong question

(aisnakeoil.substack.com)

340 points agomez314 | 1 comments | 21 Mar 23 13:12 UTC | HN request time: 0.206s | source

Show context

kybernetikos ◴[21 Mar 23 14:25 UTC] No.35246617[source]▶

I gave ChatGPT the four cards logic puzzle, lots of humans struggle with it, but chatGPT got it exactly right, so I was very impressed. Then I realised that the formulation I'd given it (the same as the original study) was almost certainly part of its training set.

I made extremely minor changes to the way the question was phrased and it failed badly, not just getting the answer wrong but falling into incoherence, claiming that T was a vowel, or that 3 was an even number.

The largeness of its training set can give an incorrect impression of its reasoning capabilities. It can apply simple logic to situations, even situations it hasn't seen, but the logic can't get much beyond the first couple of lectures in an introductory First-order Logic course before it starts to fall apart if it can't lean on its large training set data.

The fact that it can do logic at all is impressive to me though, I'm interested to see how much deeper its genuine capability goes as we get more advanced models.

replies(3): >>35246998 #>>35251307 #>>35252827 #

zarzavat ◴[21 Mar 23 14:49 UTC] No.35246998[source]▶

>>35246617 #

GPT is a token prediction engine. It predicts what the next token is, and it does that very well. Its logical abilities are emergent and are limited by the design of the network. Transformers are constant-time computations: they compute a fixed number of steps and then they stop and produce a result. This is very different to how humans think, we can expend more time on a difficult task (sometimes years!), or give an instant answer to an easy task. And we have a conception of when a task is done, or when we have to think more.

replies(1): >>35252192 #

1. kybernetikos ◴[21 Mar 23 20:39 UTC] No.35252192[source]▶

>>35246998 #

> This is very different to how humans think, we can expend more time on a difficult task (sometimes years!)

When we do that, we maintain a chain of thought. It's absolutely possible to get ChatGPT (for instance) to maintain a chain of thought by asking it to plan steps and describe plans before following them. It can allow it to tackle more difficult problems with better results.

I don't think we know enough yet about how humans think to be confident in saying that "This is very different to how humans think".

↑