Most active commenters

barrkel(3)

A non-anthropomorphized view of LLMs

(addxorrol.blogspot.com)

Show context

barrkel ◴[06 Jul 25 23:14 UTC] No.44485012[source]▶

The problem with viewing LLMs as just sequence generators, and malbehaviour as bad sequences, is that it simplifies too much. LLMs have hidden state not necessarily directly reflected in the tokens being produced and it is possible for LLMs to output tokens in opposition to this hidden state to achieve longer term outcomes (or predictions, if you prefer).

Is it too anthropomorphic to say that this is a lie? To say that the hidden state and its long term predictions amount to a kind of goal? Maybe it is. But we then need a bunch of new words which have almost 1:1 correspondence to concepts from human agency and behavior to describe the processes that LLMs simulate to minimize prediction loss.

Reasoning by analogy is always shaky. It probably wouldn't be so bad to do so. But it would also amount to impenetrable jargon. It would be an uphill struggle to promulgate.

Instead, we use the anthropomorphic terminology, and then find ways to classify LLM behavior in human concept space. They are very defective humans, so it's still a bit misleading, but at least jargon is reduced.

replies(7): >>44485190 #>>44485198 #>>44485223 #>>44486284 #>>44487390 #>>44489939 #>>44490075 #

d3m0t3p ◴[06 Jul 25 23:46 UTC] No.44485223[source]▶

>>44485012 #

Do they ? LLM embedd the token sequence N^{L} to R^{LxD}, we have some attention and the output is also R^{LxD}, then we apply a projection to the vocabulary and we get R^{LxV} we get therefore for each token a likelihood over the voc. In the attention, you can have Multi Head attention (or whatever version is fancy: GQA,MLA) and therefore multiple representation, but it is always tied to a token. I would argue that there is no hidden state independant of a token.

Whereas LSTM, or structured state space for example have a state that is updated and not tied to a specific item in the sequence.

I would argue that his text is easily understandable except for the notation of the function, explaining that you can compute a probability based on previous words is understandable by everyone without having to resort to anthropomorphic terminology

replies(1): >>44485294 #

barrkel ◴[06 Jul 25 23:56 UTC] No.44485294[source]▶

>>44485223 #

There is hidden state as plain as day merely in the fact that logits for token prediction exist. The selected token doesn't give you information about how probable other tokens were. That information, that state which is recalculated in autoregression, is hidden. It's not exposed. You can't see it in the text produced by the model.

There is plenty of state not visible when an LLM starts a sentence that only becomes somewhat visible when it completes the sentence. The LLM has a plan, if you will, for how the sentence might end, and you don't get to see an instance of that plan unless you run autoregression far enough to get those tokens.

Similarly, it has a plan for paragraphs, for whole responses, for interactive dialogues, plans that include likely responses by the user.

replies(2): >>44485385 #>>44485919 #

1. gpm ◴[07 Jul 25 01:44 UTC] No.44485919[source]▶

>>44485294 #

The LLM does not "have" a plan.

Arguably there's reason to believe it comes up with a plan when it is computing token propabilities, but it does not store it between tokens. I.e. it doesn't possess or "have" it. It simply comes up with a plan, emits a token, and entirely throws all its intermediate thoughts (including any plan) to start again from scratch on the next token.

replies(4): >>44485976 #>>44486317 #>>44488268 #>>44488470 #

2. NiloCK ◴[07 Jul 25 01:53 UTC] No.44485976[source]▶

>>44485919 (TP) #

I don't think that the comment above you made any suggestion that the plan is persisted between token generations. I'm pretty sure you described exactly what they intended.

replies(2): >>44486020 #>>44488767 #

3. gpm ◴[07 Jul 25 02:00 UTC] No.44486020[source]▶

>>44485976 #

I agree. I'm suggesting that the language they are using is unintentionally misleading, not that they are factually wrong.

4. lostmsu ◴[07 Jul 25 02:55 UTC] No.44486317[source]▶

>>44485919 (TP) #

This is wrong, intermediate activations are preserved when going forward.

replies(1): >>44488134 #

5. ACCount36 ◴[07 Jul 25 08:49 UTC] No.44488134[source]▶

>>44486317 #

Within a single forward pass, but not from one emitted token to another.

replies(1): >>44490852 #

6. yorwba ◴[07 Jul 25 09:11 UTC] No.44488268[source]▶

>>44485919 (TP) #

It's true that the last layer's output for a given input token only affects the corresponding output token and is discarded afterwards. But the penultimate layer's output affects the computation of the last layer for all future tokens, so it is not discarded, but stored (in the KV cache). Similarly for the antepenultimate layer affecting the penultimate layer and so on.

So there's plenty of space in intermediate layers to store a plan between tokens without starting from scratch every time.

7. barrkel ◴[07 Jul 25 09:41 UTC] No.44488470[source]▶

>>44485919 (TP) #

I believe saying the LLM has a plan is a useful anthropomorphism for the fact that it does have hidden state that predicts future tokens, and this state conditions the tokens it produces earlier in the stream.

replies(2): >>44490837 #>>44492198 #

8. gugagore ◴[07 Jul 25 10:36 UTC] No.44488767[source]▶

>>44485976 #

The concept of "state" conveys two related ideas.

- the sufficient amount of information to do evolution of the system. The state of a pendulum is it's position and velocity (or momentum). If you take a single picture of a pendulum, you do not have a representation that lets you make predictions.

- information that is persisted through time. A stateful protocol is one where you need to know the history of the messages to understand what will happen next. (Or, analytically, it's enough to keep track of the sufficient state.) A procedure with some hidden state isn't a pure function. You can make it a pure function by making the state explicit.

9. ◴[07 Jul 25 14:38 UTC] No.44490837[source]▶

>>44488470 #

10. andy12_ ◴[07 Jul 25 14:39 UTC] No.44490852{3}[source]▶

>>44488134 #

What? No. The intermediate hidden states are preserved from one token to another. A token that is 100k tokens into the future will be able to look into the information of the present token's hidden state through the attention mechanism. This is why the KV cache is so big.

replies(1): >>44498567 #

11. godshatter ◴[07 Jul 25 16:50 UTC] No.44492198[source]▶

>>44488470 #

Are the devs behind the models adding their own state somehow? Do they have code that figures out a plan and use the LLM on pieces of it and stitch them together? If they do, then there is a plan, it's just not output from a magical black box. Unless they are using a neural net to figure out what the plan should be first, I guess.

I know nothing about how things work at that level, so these might not even be reasonable questions.

12. ACCount36 ◴[08 Jul 25 09:55 UTC] No.44498567{4}[source]▶

>>44490852 #

KV cache is just that: a cache.

The inference logic of an LLM remains the same. There is no difference in outcomes between recalculating everything and caching. The only difference is in the amount of memory and computation required to do it.

replies(1): >>44501203 #

13. andy12_ ◴[08 Jul 25 15:57 UTC] No.44501203{5}[source]▶

>>44498567 #

The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.

↑