Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 1 comments | 14 Nov 24 17:05 UTC | HN request time: 0s | source

Show context

PaulHoule ◴[14 Nov 24 21:56 UTC] No.42141647[source]▶

>>42138289 (OP) #

Maybe that one which plays chess well is calling out to a real chess engine.

replies(6): >>42141726 #>>42141959 #>>42142323 #>>42142342 #>>42143067 #>>42143188 #

singularity2001 ◴[14 Nov 24 22:05 UTC] No.42141726[source]▶

>>42141647 #

this possibility is discussed in the article and deemed unlikely

replies(2): >>42141784 #>>42141854 #

probably_wrong ◴[14 Nov 24 22:24 UTC] No.42141854[source]▶

>>42141726 #

Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.

The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.

[1] https://dynomight.substack.com/p/chess/comment/77190852

replies(1): >>42142803 #

refulgentis ◴[15 Nov 24 00:29 UTC] No.42142803[source]▶

>>42141854 #

What do you mean LLMs can't count to 10,000 for known reasons?

Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.

I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.

* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.

replies(1): >>42145200 #

probably_wrong ◴[15 Nov 24 09:18 UTC] No.42145200{3}[source]▶

>>42142803 #

When it comes to counting, LLMs have a couple issues.

First, tokenization: the tokenization of 1229 is not guaranteed to be [1,2,2,9] but it could very well be [12,29] and the "+1" operation could easily generate tokens [123,0] depending on frequencies in your corpus. This constant shifting in tokens makes it really hard to learn rules for "+1" ([9,9] +1 is not [9,10]). This is also why LLMs tend to fail at tasks like "how many letters does this word have?": https://news.ycombinator.com/item?id=41058318

Second, you need your network to understand that "+1" is worth learning. Writing "+1" as a combination of sigmoid, products and additions over normalized floating point values (hello loss of precision) is not trivial without degrading a chunk of your network, and what for? After all, math is not in the domain of language and, since we're not training an LMM here, your loss function may miss it entirely.

And finally there's statistics: the three-legged-dog problem is figuring out that a dog has four legs from corpora when no one ever writes "the four-legged dog" because it's obvious, but every reference to an unusual dog will include said description. So if people write "1+1 equals 3" satirically then your network may pick that up as fact. And how often has your network seen the result of "6372 + 1"?

But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954. I'm not optimistic that it will make it through without hallucinations.

replies(1): >>42154471 #

1. refulgentis ◴[16 Nov 24 05:11 UTC] No.42154471{4}[source]▶

>>42145200 #

> But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954.

Been excited to try this all day, finally got around to this, Llama 3.1 8B did it. It's my app built on llama.cpp, no shenangians, temp 0, top p 100, 4 bit quantization, model name in screenshot [^1].

I did 7824 to 8948, it protested more for 9954, which made me reconsider whether I'd want to read that many to double check :) and I figured x + 1024 is isomorphic to the original case of you trying on OpenAI and wondering if it wasn't the result of inference.

My prior was of course it would do this, its a sequence. I understand e.g. the need for token healing cases as you correctly note, that could mess up when there's e.g. notation in an equation that prevents the "correct" digit. I don't see any reason why it'd mess up a sequential list of integers.

In general, as long as its on topic, I find the handwaving people do about tokenization being a problem to be a bit silly, I'd definitely caution against using the post you linked as a citation, it reads just like a rote repetition of the idea it causes problems, its an idea that spreads like telephone.

It's also a perfect example of the weakness of the genre: just because it sees [5077, 5068, 5938] instead of "strawberry" doesn't mean it can't infer 5077 = st = 0 5068 = raw = 1 r, 5938 = berry = 2 rs. In fact, it infers things from broken up subsequences all the time -- its how it works! If doing single character tokenization got free math / counting reliability, we'd very quickly switch to it.

(not saying you're advocating for the argument or you're misinformed, just, speaking colloquially like I would with a friend over a beer)

[^] https://imgur.com/a/vEvu2GD

↑