←back to thread

129 points jxmorris12 | 1 comments | | HN request time: 0s | source
Show context
Imnimo ◴[] No.43131684[source]
I have become a little more skeptical of LLM "reasoning" after DeepSeek (and now Grok) let us see the raw outputs. Obviously we can't deny the benchmark numbers - it does get the answer right more often given thinking time, and it does let models solve really hard benchmarks. Sometimes the thoughts are scattered and inefficient, but do eventually hit on the solution. Other times, it seems like they fall into the kind of trap LeCun described.

Here are some examples from playing with Grok 3. My test query was, "What is the name of a Magic: The Gathering card that has all five vowels in it, each occurring exactly once, and the vowels appear in alphabetic order?" The motivation here is that this seems like a hard question to just one-shot, but given sufficient ability to continue recalling different card names, it's very easy to do guess-and-check. (For those interested, valid answers include "Scavenging Ghoul", "Angelic Chorus" and others)

In one attempt, Grok 3 spends 10 minutes (!!) repeatedly checking whether "Abian, Luvion Usurper" satisfies the criteria. It'll list out the vowels, conclude it doesn't match, and then go, "Wait, but let's think differently. Maybe the card is "Abian, Luvion Usurper," but no", and just produce variants of that thinking. Counting occurences of the word "Abian" suggests it tested this theory 800 times before eventually timing out (or otherwise breaking), presumably just because the site got overloaded.

In a second attempt, it decides to check "Our Market Research Shows That Players Like Really Long Card Names So We Made this Card to Have the Absolute Longest Card Name Ever Elemental" (this a real card from a joke set). It attempts to write out the vowels:

>but let's check its vowels: O, U, A, E, E, A, E, A, E, I, E, A, E, O, A, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O, A, E ...

It continues like this for about 600 more vowels, before emitting a random Russian(?) word and breaking out:

>...E, O, A, E, E, E, A, E, O, A, E, E, E, A, E, O продуктив

These two examples seem like the sort of failures LeCun conjectured. The model gets into a cycle self-reinforced unproductive behavior. Every time it checks Abian, or emits another "AEEEAO", it becomes even more probable that the next tokens should be the same.

replies(5): >>43132153 #>>43132334 #>>43132654 #>>43133212 #>>43188190 #
dunefox ◴[] No.43133212[source]
I don't know whether treating a model as a database is really a good measure.
replies(1): >>43133298 #
1. Imnimo ◴[] No.43133298[source]
Yeah, I'm not so much interested in "can you think of the right card name from among thousands?". I just want to see that it can produce a thinking procedure that makes sense. If it ends up not being able to recall the right name despite following a good process of guess-and-check, I'd still consider that a satisfactory result.

And to the models' credit, they do start off with a valid guess-and-check process. They list cards, write out the vowels, and see whether it fits the criteria. But eventually they tend to go off the rails in a way that is worrying.