Most active commenters

barrkel(9)
gugagore(6)
positron26(5)
cmiles74(3)
8note(3)
brookst(3)
NiloCK(3)
sothatsit(3)
lostmsu(3)
hackinthebochs(3)

Popular/hot comments

>>44485198 #
>>44485919 #
>>44485190 #
>>44485311 #
>>44485383 #
>>44486303 #
>>44486324 #
>>44490075 #
>>44490361 #

←back to thread

A non-anthropomorphized view of LLMs

(addxorrol.blogspot.com)

1. barrkel ◴[06 Jul 25 23:14 UTC] No.44485012[source]▶

>>44484682 (OP) #

The problem with viewing LLMs as just sequence generators, and malbehaviour as bad sequences, is that it simplifies too much. LLMs have hidden state not necessarily directly reflected in the tokens being produced and it is possible for LLMs to output tokens in opposition to this hidden state to achieve longer term outcomes (or predictions, if you prefer).

Is it too anthropomorphic to say that this is a lie? To say that the hidden state and its long term predictions amount to a kind of goal? Maybe it is. But we then need a bunch of new words which have almost 1:1 correspondence to concepts from human agency and behavior to describe the processes that LLMs simulate to minimize prediction loss.

Reasoning by analogy is always shaky. It probably wouldn't be so bad to do so. But it would also amount to impenetrable jargon. It would be an uphill struggle to promulgate.

Instead, we use the anthropomorphic terminology, and then find ways to classify LLM behavior in human concept space. They are very defective humans, so it's still a bit misleading, but at least jargon is reduced.

replies(7): >>44485190 #>>44485198 #>>44485223 #>>44486284 #>>44487390 #>>44489939 #>>44490075 #

2. gugagore ◴[06 Jul 25 23:42 UTC] No.44485190[source]▶

>>44485012 (TP) #

I'm not sure what you mean by "hidden state". If you set aside chain of thought, memories, system prompts, etc. and the interfaces that don't show them, there is no hidden state.

These LLMs are almost always, to my knowledge, autoregressive models, not recurrent models (Mamba is a notable exception).

replies(3): >>44485271 #>>44485298 #>>44485311 #

3. cmiles74 ◴[06 Jul 25 23:43 UTC] No.44485198[source]▶

>>44485012 (TP) #

IMHO, anthrophormization of LLMs is happening because it's perceived as good marketing by big corporate vendors.

People are excited about the technology and it's easy to use the terminology the vendor is using. At that point I think it gets kind of self fulfilling. Kind of like the meme about how to pronounce GIF.

replies(6): >>44485304 #>>44485383 #>>44486029 #>>44486290 #>>44487414 #>>44487524 #

4. d3m0t3p ◴[06 Jul 25 23:46 UTC] No.44485223[source]▶

>>44485012 (TP) #

Do they ? LLM embedd the token sequence N^{L} to R^{LxD}, we have some attention and the output is also R^{LxD}, then we apply a projection to the vocabulary and we get R^{LxV} we get therefore for each token a likelihood over the voc. In the attention, you can have Multi Head attention (or whatever version is fancy: GQA,MLA) and therefore multiple representation, but it is always tied to a token. I would argue that there is no hidden state independant of a token.

Whereas LSTM, or structured state space for example have a state that is updated and not tied to a specific item in the sequence.

I would argue that his text is easily understandable except for the notation of the function, explaining that you can compute a probability based on previous words is understandable by everyone without having to resort to anthropomorphic terminology

replies(1): >>44485294 #

5. barrkel ◴[06 Jul 25 23:53 UTC] No.44485271[source]▶

>>44485190 #

Hidden state in the form of the activation heads, intermediate activations and so on. Logically, in autoregression these are recalculated every time you run the sequence to predict the next token. The point is, the entire NN state isn't output for each token. There is lots of hidden state that goes into selecting that token and the token isn't a full representation of that information.

replies(2): >>44485334 #>>44485360 #

6. barrkel ◴[06 Jul 25 23:56 UTC] No.44485294[source]▶

>>44485223 #

There is hidden state as plain as day merely in the fact that logits for token prediction exist. The selected token doesn't give you information about how probable other tokens were. That information, that state which is recalculated in autoregression, is hidden. It's not exposed. You can't see it in the text produced by the model.

There is plenty of state not visible when an LLM starts a sentence that only becomes somewhat visible when it completes the sentence. The LLM has a plan, if you will, for how the sentence might end, and you don't get to see an instance of that plan unless you run autoregression far enough to get those tokens.

Similarly, it has a plan for paragraphs, for whole responses, for interactive dialogues, plans that include likely responses by the user.

replies(2): >>44485385 #>>44485919 #

7. 8note ◴[06 Jul 25 23:57 UTC] No.44485298[source]▶

>>44485190 #

do LLM models consider future tokens when making next token predictions?

eg. pick 'the' as the next token because there's a strong probability of 'planet' as the token after?

is it only past state that influences the choice of 'the'? or that the model is predicting many tokens in advance and only returning the one in the output?

if it does predict many, id consider that state hidden in the model weights.

replies(2): >>44485341 #>>44485956 #

8. Angostura ◴[06 Jul 25 23:59 UTC] No.44485304[source]▶

>>44485198 #

IMHO it happens for the same reason we see shapes in clouds. The human mind through millions of years has evolved to equate and conflate the ability to generate cogent verbal or written output with intelligence. It's an instinct to equate the two. It's an extraordinarily difficult instinct to break. LLMs are optimised for the one job that will make us confuse them for being intelligent

replies(2): >>44485539 #>>44494579 #

9. halJordan ◴[07 Jul 25 00:00 UTC] No.44485311[source]▶

>>44485190 #

If you dont know, that's not necessarily anyone's fault, but why are you dunking into the conversation? The hidden state is a foundational part of a transformers implementation. And because we're not allowed to use metaphors because that is too anthropomorphic, then youre just going to have to go learn the math.

replies(3): >>44485442 #>>44485457 #>>44485542 #

10. gugagore ◴[07 Jul 25 00:04 UTC] No.44485334{3}[source]▶

>>44485271 #

That's not what "state" means, typically. The "state of mind" you're in affects the words you say in response to something.

Intermediate activations isn't "state". The tokens that have already been generated, along with the fixed weights, is the only data that affects the next tokens.

replies(2): >>44485915 #>>44488490 #

11. patcon ◴[07 Jul 25 00:05 UTC] No.44485341{3}[source]▶

>>44485298 #

I think recent Anthropic work showed that they "plan" future tokens in advance in an emergent way:

https://www.anthropic.com/research/tracing-thoughts-language...

replies(1): >>44485455 #

12. brookst ◴[07 Jul 25 00:09 UTC] No.44485360{3}[source]▶

>>44485271 #

State typically means between interactions. By this definition a simple for loop has “hidden state” in the counter.

replies(1): >>44485945 #

13. brookst ◴[07 Jul 25 00:13 UTC] No.44485383[source]▶

>>44485198 #

Nobody cares about what’s perceived as good marketing. People care about what resonates with the target market.

But yes, anthropomorphising LLMs is inevitable because they feel like an entity. People treat stuffed animals like creatures with feelings and personality; LLMs are far closer than that.

replies(3): >>44485423 #>>44485584 #>>44485837 #

14. 8note ◴[07 Jul 25 00:14 UTC] No.44485385{3}[source]▶

>>44485294 #

this sounds like a fun research area. do LLMs have plans about future tokens?

how do we get 100 tokens of completion, and not just one output layer at a time?

are there papers youve read that you can share that support the hypothesis? vs that the LLM doesnt have ideas about the future tokens when its predicting the next one?

replies(2): >>44485495 #>>44485505 #

15. cmiles74 ◴[07 Jul 25 00:20 UTC] No.44485423{3}[source]▶

>>44485383 #

Alright, let’s agree that good marketing resonates with the target market. ;-)

replies(1): >>44485456 #

16. markerz ◴[07 Jul 25 00:25 UTC] No.44485442{3}[source]▶

>>44485311 #

I don't think your response is very productive, and I find that my understanding of LLMs aligns with the person you're calling out. We could both be wrong, but I'm grateful that someone else spoke saying that it doesn't seem to match their mental model and we would all love to learn a more correct way of thinking about LLMs.

Telling us to just go and learn the math is a little hurtful and doesn't really get me any closer to learning the math. It gives gatekeeping.

17. 8note ◴[07 Jul 25 00:27 UTC] No.44485455{4}[source]▶

>>44485341 #

oo thanks!

18. brookst ◴[07 Jul 25 00:27 UTC] No.44485456{4}[source]▶

>>44485423 #

I 1000% agree. It’s a vicious, evolutionary, and self-selecting process.

It takes great marketing to actually have any character and intent at all.

19. tbrownaw ◴[07 Jul 25 00:27 UTC] No.44485457{3}[source]▶

>>44485311 #

The comment you are replying to is not claiming ignorance of how models work. It is saying that the author does know how they work, and they do not contain anything that can properly be described as "hidden state". The claimed confusion is over how the term "hidden state" is being used, on the basis that it is not being used correctly.

20. Zee2 ◴[07 Jul 25 00:31 UTC] No.44485495{4}[source]▶

>>44485385 #

This research has been done, it was a core pillar of the recent Anthropic paper on token planning and interpretability.

https://www.anthropic.com/research/tracing-thoughts-language...

See section “Does Claude plan its rhymes?”?

21. XenophileJKO ◴[07 Jul 25 00:32 UTC] No.44485505{4}[source]▶

>>44485385 #

Lol... Try building systems off them and you will very quickly learn concretely that they "plan".

It may not be as evident now as it was with earlier models. The models will fabricate preconditions needed to output the final answer it "wanted".

I ran into this when using quasi least-to-most style structured output.

22. gugagore ◴[07 Jul 25 00:40 UTC] No.44485542{3}[source]▶

>>44485311 #

Do you appreciate a difference between an autoregressive model and a recurrent model?

The "transformer" part isn't under question. It's the "hidden state" part.

23. DrillShopper ◴[07 Jul 25 00:47 UTC] No.44485584{3}[source]▶

>>44485383 #

> People treat stuffed animals like creatures with feelings and personality; LLMs are far closer than that.

Children do, some times, but it's a huge sign of immaturity when adults, let alone tech workers, do it.

I had a professor at University that would yell at us if/when we personified/anthropomorphized the tech, and I have that same urge when people ask me "What does <insert LLM name here> think?".

24. roywiggins ◴[07 Jul 25 01:31 UTC] No.44485837{3}[source]▶

>>44485383 #

the chat interface was a choice, though a natural one. before they'd RLHFed it into chatting and it was just GPT 3 offering completions 1) not very many people used it and 2) it was harder to anthropomorphize

25. NiloCK ◴[07 Jul 25 01:44 UTC] No.44485915{4}[source]▶

>>44485334 #

Plus a randomness seed.

The 'hidden state' being referred to here is essentially the "what might have been" had the dice rolls gone differently (eg, been seeded differently).

replies(1): >>44488536 #

26. gpm ◴[07 Jul 25 01:44 UTC] No.44485919{3}[source]▶

>>44485294 #

The LLM does not "have" a plan.

Arguably there's reason to believe it comes up with a plan when it is computing token propabilities, but it does not store it between tokens. I.e. it doesn't possess or "have" it. It simply comes up with a plan, emits a token, and entirely throws all its intermediate thoughts (including any plan) to start again from scratch on the next token.

replies(4): >>44485976 #>>44486317 #>>44488268 #>>44488470 #

27. ChadNauseam ◴[07 Jul 25 01:49 UTC] No.44485945{4}[source]▶

>>44485360 #

Hidden layer is a term of art in machine learning / neural network research. See https://en.wikipedia.org/wiki/Hidden_layer . Somehow this term mutated into "hidden state", which in informal contexts does seem to be used quite often the way the grandparent comment used it.

replies(1): >>44486332 #

28. NiloCK ◴[07 Jul 25 01:50 UTC] No.44485956{3}[source]▶

>>44485298 #

The most obvious case of this is in terms of `an apple` vs `a pear`. LLMs never get the a-an distinction wrong, because their internal state 'knows' the word that'll come next.

replies(1): >>44486790 #

29. NiloCK ◴[07 Jul 25 01:53 UTC] No.44485976{4}[source]▶

>>44485919 #

I don't think that the comment above you made any suggestion that the plan is persisted between token generations. I'm pretty sure you described exactly what they intended.

replies(2): >>44486020 #>>44488767 #

30. gpm ◴[07 Jul 25 02:00 UTC] No.44486020{5}[source]▶

>>44485976 #

I agree. I'm suggesting that the language they are using is unintentionally misleading, not that they are factually wrong.

31. sothatsit ◴[07 Jul 25 02:01 UTC] No.44486029[source]▶

>>44485198 #

I think anthropomorphizing LLMs is useful, not just a marketing tactic. A lot of intuitions about how humans think map pretty well to LLMs, and it is much easier to build intuitions about how LLMs work by building upon our intuitions about how humans think than by trying to build your intuitions from scratch.

Would this question be clear for a human? If so, it is probably clear for an LLM. Did I provide enough context for a human to diagnose the problem? Then an LLM will probably have a better chance of diagnosing the problem. Would a human find the structure of this document confusing? An LLM would likely perform poorly when reading it as well.

Re-applying human intuitions to LLMs is a good starting point to gaining intuition about how to work with LLMs. Conversely, understanding sequences of tokens and probability spaces doesn't give you much intuition about how you should phrase questions to get good responses from LLMs. The technical reality doesn't explain the emergent behaviour very well.

I don't think this is mutually exclusive with what the author is talking about either. There are some ways that people think about LLMs where I think the anthropomorphization really breaks down. I think the author says it nicely:

> The moment that people ascribe properties such as "consciousness" or "ethics" or "values" or "morals" to these learnt mappings is where I tend to get lost.

replies(2): >>44487443 #>>44494411 #

32. positron26 ◴[07 Jul 25 02:46 UTC] No.44486284[source]▶

>>44485012 (TP) #

> Is it too anthropomorphic to say that this is a lie?

Yes. Current LLMs can only introspect from output tokens. You need hidden reasoning that is within the black box, self-knowing, intent, and motive to lie.

I rather think accusing an LLM of lying is like accusing a mousetrap of being a murderer.

When models have online learning, complex internal states, and reflection, I might consider one to have consciousness and to be capable of lying. It will need to manifest behaviors that can only emerge from the properties I listed.

I've seen similar arguments where people assert that LLMs cannot "grasp" what they are talking about. I strongly suspect a high degree of overlap between those willing to anthropomorphize error bars as lies while declining to award LLMs "grasping". Which is it? It can think or it cannot? (objectively, SoTA models today cannot yet.) The willingness to waffle and pivot around whichever perspective damns the machine completely belies the lack of honesty in such conversations.

replies(1): >>44486303 #

33. positron26 ◴[07 Jul 25 02:48 UTC] No.44486290[source]▶

>>44485198 #

> because it's perceived as good marketing

We are making user interfaces. Good user interfaces are intuitive and purport to be things that users are familiar with, such as people. Any alternative explanation of such a versatile interface will be met with blank stares. Users with no technical expertise would come to their own conclusions, helped in no way by telling the user not to treat the chat bot as a chat bot.

34. lostmsu ◴[07 Jul 25 02:51 UTC] No.44486303[source]▶

>>44486284 #

> Current LLMs can only introspect from output tokens

The only interpretation of this statement I can come up with is plain wrong. There's no reason LLM shouldn't be able to introspect without any output tokens. As the GP correctly says, most of the processing in LLMs happens over hidden states. Output tokens are just an artefact for our convenience, which also happens to be the way the hidden state processing is trained.

replies(3): >>44486324 #>>44487399 #>>44487619 #

35. lostmsu ◴[07 Jul 25 02:55 UTC] No.44486317{4}[source]▶

>>44485919 #

This is wrong, intermediate activations are preserved when going forward.

replies(1): >>44488134 #

36. positron26 ◴[07 Jul 25 02:57 UTC] No.44486324{3}[source]▶

>>44486303 #

There are no recurrent paths besides tokens. How may I introspect something if it is not an input? I may not.

replies(3): >>44487610 #>>44488622 #>>44488738 #

37. lostmsu ◴[07 Jul 25 02:58 UTC] No.44486332{5}[source]▶

>>44485945 #

It makes sense in LLM context because the processing of these is time-sequential in LLM's internal time.

38. 3eb7988a1663 ◴[07 Jul 25 04:35 UTC] No.44486790{4}[source]▶

>>44485956 #

If I give an LLM a fragment of text that starts with, "The fruit they ate was an <TOKEN>", regardless of any plan, the grammatically correct answer is going to force a noun starting with a vowel. How do you disentangle the grammar from planning?

Going to be a lot more "an apple" in the corpus than "an pear"

39. viccis ◴[07 Jul 25 06:48 UTC] No.44487390[source]▶

>>44485012 (TP) #

I think that the hidden state is really just at work improving the model's estimation of the joint probability over tokens. And the assumption here, which failed miserably in the early 20th century in the work of the logical posivitists, is that if you can so expertly estimate that joint probability of language, then you will be able to understand "knowledge." But there's no well grounded reason to believe that and plenty of the reasons (see: the downfall of logical posivitism) to think that language is an imperfect representation of knowledge. In other words, what humans do when we think is more complicated than just learning semiotic patterns and regurgitating them. Philosophical skeptics like Hume thought so, but most epistemology writing after that had better answers for how we know things.

replies(1): >>44488934 #

40. delusional ◴[07 Jul 25 06:49 UTC] No.44487399{3}[source]▶

>>44486303 #

> Output tokens are just an artefact for our convenience

That's nonsense. The hidden layers are specifically constructed to increase the probability that the model picks the right next word. Without the output/token generation stage the hidden layers are meaningless. Just empty noise.

It is fundamentally an algorithm for generating text. If you take the text away it's just a bunch of fmadds. A mute person can still think, an LLM without output tokens can do nothing.

replies(1): >>44503614 #

41. mikojan ◴[07 Jul 25 06:52 UTC] No.44487414[source]▶

>>44485198 #

True but also researchers want to believe they are studying intelligence not just some approximation to it.

42. otabdeveloper4 ◴[07 Jul 25 06:57 UTC] No.44487443{3}[source]▶

>>44486029 #

You think it's useful because Big Corp sold you that lie.

Wait till the disillusionment sets in.

replies(1): >>44488342 #

43. Marazan ◴[07 Jul 25 07:12 UTC] No.44487524[source]▶

>>44485198 #

aAnthrophormisation happens because Humans are absolutely terrible at evaluating systems that give converdational text output.

ELIZA fooled many people into think it was conscious and it wasn't even trying to do that.

44. throw310822 ◴[07 Jul 25 07:28 UTC] No.44487610{4}[source]▶

>>44486324 #

Introspection doesn't have to be recurrent. It can happen during the generation of a single token.

45. Marazan ◴[07 Jul 25 07:29 UTC] No.44487619{3}[source]▶

>>44486303 #

"Hidden layers" are not "hidden state".

Saying so is just unbelievably confusing.

46. ACCount36 ◴[07 Jul 25 08:49 UTC] No.44488134{5}[source]▶

>>44486317 #

Within a single forward pass, but not from one emitted token to another.

replies(1): >>44490852 #

47. yorwba ◴[07 Jul 25 09:11 UTC] No.44488268{4}[source]▶

>>44485919 #

It's true that the last layer's output for a given input token only affects the corresponding output token and is discarded afterwards. But the penultimate layer's output affects the computation of the last layer for all future tokens, so it is not discarded, but stored (in the KV cache). Similarly for the antepenultimate layer affecting the penultimate layer and so on.

So there's plenty of space in intermediate layers to store a plan between tokens without starting from scratch every time.

48. sothatsit ◴[07 Jul 25 09:21 UTC] No.44488342{4}[source]▶

>>44487443 #

No, I think it's useful because it is useful, and I've made use of it a number of times.

49. barrkel ◴[07 Jul 25 09:41 UTC] No.44488470{4}[source]▶

>>44485919 #

I believe saying the LLM has a plan is a useful anthropomorphism for the fact that it does have hidden state that predicts future tokens, and this state conditions the tokens it produces earlier in the stream.

replies(2): >>44490837 #>>44492198 #

50. barrkel ◴[07 Jul 25 09:46 UTC] No.44488490{4}[source]▶

>>44485334 #

Sure it's state. It logically evolves stepwise per token generation. It encapsulates the LLM's understanding of the text so far so it can predict the next token. That it is merely a fixed function of other data isn't interesting or useful to say.

All deterministic programs are fixed functions of program code, inputs and computation steps, but we don't say that they don't have state. It's not a useful distinction for communicating among humans.

replies(1): >>44488841 #

51. barrkel ◴[07 Jul 25 09:54 UTC] No.44488536{5}[source]▶

>>44485915 #

No, that's not quite what I mean. I used the logits in another reply to point out that there is data specific to the generation process that is not available from the tokens, but there's also the network activations adding up to that state.

Processing tokens is a bit like ticks in a CPU, where the model weights are the program code, and tokens are both input and output. The computation that occurs logically retains concepts and plans over multiple token generation steps.

That it is fully deterministic is no more interesting than saying a variable in a single threaded program is not state because you can recompute its value by replaying the program with the same inputs. It seems to me that this uninteresting distinction is the GP's issue.

52. barrkel ◴[07 Jul 25 10:09 UTC] No.44488622{4}[source]▶

>>44486324 #

The recurrence comes from replaying tokens during autoregression.

It's as if you have a variable in a deterministic programming language, only you have to replay the entire history of the program's computation and input to get the next state of the machine (program counter + memory + registers).

Producing a token for an LLM is analogous to a tick of the clock for a CPU. It's the crank handle that drives the process.

53. hackinthebochs ◴[07 Jul 25 10:32 UTC] No.44488738{4}[source]▶

>>44486324 #

Important attention heads or layers within an LLM can be repeated giving you an "unrolled" recursion.

replies(1): >>44488792 #

54. gugagore ◴[07 Jul 25 10:36 UTC] No.44488767{5}[source]▶

>>44485976 #

The concept of "state" conveys two related ideas.

- the sufficient amount of information to do evolution of the system. The state of a pendulum is it's position and velocity (or momentum). If you take a single picture of a pendulum, you do not have a representation that lets you make predictions.

- information that is persisted through time. A stateful protocol is one where you need to know the history of the messages to understand what will happen next. (Or, analytically, it's enough to keep track of the sufficient state.) A procedure with some hidden state isn't a pure function. You can make it a pure function by making the state explicit.

55. positron26 ◴[07 Jul 25 10:41 UTC] No.44488792{5}[source]▶

>>44488738 #

An unrolled loop in a feed-forward network is all just that. The computation is DAG.

replies(1): >>44488860 #

56. gugagore ◴[07 Jul 25 10:49 UTC] No.44488841{5}[source]▶

>>44488490 #

I'll say it once more: I think it is useful to distinguish between autoregressive and recurrent architectures. A clear way to make that distinction is to agree that the recurrent architecture has hidden state, while the autoregressive one does not. A recurrent model has some point in a space that "encapsulates its understanding". This space is "hidden" in the sense that it doesn't correspond to text tokens or any other output. This space is "state" in the sense that it is sufficient to summarize the history of the inputs for the sake of predicting the next output.

When you use "hidden state" the way you are using it, I am left wondering how you make a distinction between autoregressive and recurrent architectures.

replies(2): >>44488974 #>>44489098 #

57. hackinthebochs ◴[07 Jul 25 10:51 UTC] No.44488860{6}[source]▶

>>44488792 #

But the function of an unrolled recursion is the same as a recursive function with bounded depth as long as the number of unrolled steps match. The point is whatever function recursion is supposed to provide can plausibly be present in LLMs.

replies(1): >>44489307 #

58. FeepingCreature ◴[07 Jul 25 11:03 UTC] No.44488934[source]▶

>>44487390 #

There are many theories that are true but not trivially true. That is, they take a statement that seems true and derive from it a very simple model, which is then often disproven. In those cases however, just because the trivial model was disproven doesn't mean the theory was, though it may lose some of its luster by requiring more complexity.

59. FeepingCreature ◴[07 Jul 25 11:09 UTC] No.44488974{6}[source]▶

>>44488841 #

The words "hidden" and "state" have commonsense meanings. If recurrent architectures want a term for their particular way of storing hidden state they can make up one that isn't ambiguous imo.

"Transformers do not have hidden state" is, as we can clearly see from this thread, far more misleading than the opposite.

60. gugagore ◴[07 Jul 25 11:23 UTC] No.44489098{6}[source]▶

>>44488841 #

I'll also point out what is most important part from your original message:

> LLMs have hidden state not necessarily directly reflected in the tokens being produced, and it is possible for LLMs to output tokens in opposition to this hidden state to achieve longer-term outcomes (or predictions, if you prefer).

But what does it mean for an LLM to output a token in opposition to its hidden state? If there's a longer-term goal, it either needs to be verbalized in the output stream, or somehow reconstructed from the prompt on each token.

There’s some work (a link would be great) that disentangles whether chain-of-thought helps because it gives the model more FLOPs to process, or because it makes its subgoals explicit—e.g., by outputting “Okay, let’s reason through this step by step...” versus just "...." What they find is that even placeholder tokens like "..." can help.

That seems to imply some notion of evolving hidden state! I see how that comes in!

But crucially, in autoregressive models, this state isn’t persisted across time. Each token is generated afresh, based only on the visible history. The model’s internal (hidden) layers are certainly rich and structured and "non verbal".

But any nefarious intention or conclusion has to be arrived at on every forward pass.

replies(2): >>44489774 #>>44504493 #

61. positron26 ◴[07 Jul 25 11:46 UTC] No.44489307{7}[source]▶

>>44488860 #

And then during the next token, all of that bounded depth is thrown away except for the token of output.

You're fixating on the pseudo-computation within a single token pass. This is very limited compared to actual hidden state retention and the introspection that would enable if we knew how to train it and do online learning already.

The "reasoning" hack would not be a realistic implementation choice if the models had hidden state and could ruminate on it without showing us output.

replies(1): >>44489453 #

62. hackinthebochs ◴[07 Jul 25 12:03 UTC] No.44489453{8}[source]▶

>>44489307 #

Sure. But notice "ruminate" is different than introspect, which was what your original comment was about.

63. inciampati ◴[07 Jul 25 12:40 UTC] No.44489774{7}[source]▶

>>44489098 #

You're correct, the distinction matters. Autoregressive models have no hidden state between tokens, just the visible sequence. Every forward pass starts fresh from the tokens alone.But that's precisely why they need chain-of-thought: they're using the output sequence itself as their working memory. It's computationally universal but absurdly inefficient, like having amnesia between every word and needing to re-read everything you've written.https://thinks.lol/2025/01/memory-makes-computation-universa...

64. derbOac ◴[07 Jul 25 12:59 UTC] No.44489939[source]▶

>>44485012 (TP) #

Maybe it's just because so much of my work for so long has focused on models with hidden states but this is a fairly classical feature of some statistical models. One of the widely used LLM textbooks even started with latent variable models; LLMs are just latent variable models just on a totally different scale, both in terms of number of parameters but also model complexity. The scale is apparently important, but seeing them as another type of latent variable model sort of dehumanizes them for me.

Latent variable or hidden state models have their own history of being seen as spooky or mysterious though; in some ways the way LLMs are anthropomorphized is an extension of that.

I guess I don't have a problem with anthropomorphizing LLMs at some level, because some features of them find natural analogies in cognitive science and other areas of psychology, and abstraction is useful or even necessary in communicating and modeling complex systems. However, I do think anthropomorphizing leads to a lot of hype and tends to implicitly shut down thinking of them mechanistically, as a mathematical object that can be probed and characterized — it can lead to a kind of "ghost in the machine" discourse and an exaggeration of their utility, even if it is impressive at times.

65. tdullien ◴[07 Jul 25 13:15 UTC] No.44490075[source]▶

>>44485012 (TP) #

Author of the original article here. What hidden state are you referring to? For most LLMs the context is the state, and there is no "hidden" state. Could you explain what you mean? (Apologies if I can't see it directly)

replies(3): >>44490361 #>>44496337 #>>44504559 #

66. lukeschlather ◴[07 Jul 25 13:49 UTC] No.44490361[source]▶

>>44490075 #

Yes, strictly speaking, the model itself is stateless, but there are 600B parameters of state machine for frontier models that define which token to pick next. And that state machine is both incomprehensibly large and also of a similar magnitude in size to a human brain. (Probably, I'll grant it's possible it's smaller, but it's still quite large.)

I think my issue with the "don't anthropomorphize" is that it's unclear to me that the main difference between a human and an LLM isn't simply the inability for the LLM to rewrite its own model weights on the fly. (And I say "simply" but there's obviously nothing simple about it, and it might be possible already with current hardware, we just don't know how to do it.)

Even if we decide it is clearly different, this is still an incredibly large and dynamic system. "Stateless" or not, there's an incredible amount of state that is not comprehensible to me.

replies(3): >>44490546 #>>44490762 #>>44491161 #

67. tdullien ◴[07 Jul 25 14:09 UTC] No.44490546{3}[source]▶

>>44490361 #

Fair, there is a lot that is incomprehensible to all of us. I wouldn't call it "state" as it's fixed, but that is a rather subtle point.

That said, would you anthropomorphize a meteorological simulation just because it contains lots and lots of constants that you don't understand well?

I'm pretty sure that recurrent dynamical systems pretty quickly become universal computers, but we are treating those that generate human language differently from others, and I don't quite see the difference.

replies(1): >>44495813 #

68. jazzyjackson ◴[07 Jul 25 14:30 UTC] No.44490762{3}[source]▶

>>44490361 #

FWIW the number of parameters in a LLM is in the same ballpark as the number of nuerons in a human (roughly 80B) but neurons are not weights, they are kind of a nueral net unto themselves, stateful, adaptive, self modifying, a good variety of neurotransmitters (and their chemical analogs) aside from just voltage.

It's fun to think about just how fantastic a brain is, and how much wattage and data-center-scale we're throwing around trying to approximate its behavior. Mega-effecient and mega-dense. I'm bearish on AGI simply from an internetworking standpoint, the speed of light is hard to beat and until you can fit 80 billion interconnected cores in half a cubic foot you're just not going to get close to the responsiveness of reacting to the world in real time as biology manages to do. but that's a whole nother matter. I just wanted to pick apart that magnitude of parameters is not an altogether meaningful comparison :)

69. ◴[07 Jul 25 14:38 UTC] No.44490837{5}[source]▶

>>44488470 #

70. andy12_ ◴[07 Jul 25 14:39 UTC] No.44490852{6}[source]▶

>>44488134 #

What? No. The intermediate hidden states are preserved from one token to another. A token that is 100k tokens into the future will be able to look into the information of the present token's hidden state through the attention mechanism. This is why the KV cache is so big.

replies(1): >>44498567 #

71. jibal ◴[07 Jul 25 15:06 UTC] No.44491161{3}[source]▶

>>44490361 #

> it's unclear to me that the main difference between a human and an LLM isn't simply the inability for the LLM to rewrite its own model weights on the fly.

This is "simply" an acknowledgement of extreme ignorance of how human brains work.

72. godshatter ◴[07 Jul 25 16:50 UTC] No.44492198{5}[source]▶

>>44488470 #

Are the devs behind the models adding their own state somehow? Do they have code that figures out a plan and use the LLM on pieces of it and stitch them together? If they do, then there is a plan, it's just not output from a magical black box. Unless they are using a neural net to figure out what the plan should be first, I guess.

I know nothing about how things work at that level, so these might not even be reasonable questions.

73. cmiles74 ◴[07 Jul 25 20:35 UTC] No.44494411{3}[source]▶

>>44486029 #

Take a look at the judge’s ruling in this Anthropic case:

https://news.ycombinator.com/item?id=44488331

Here’s a quote from the ruling:

“First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.”

They literally compare an LLM learning to a person learning and conflate the two. Anthropic will likely win this case because of this anthropomorphisization.

replies(1): >>44496167 #

74. ◴[07 Jul 25 20:54 UTC] No.44494579{3}[source]▶

>>44485304 #

75. lukeschlather ◴[08 Jul 25 00:16 UTC] No.44495813{4}[source]▶

>>44490546 #

Meteorological simulations don't contain detailed state machines that are intended to encode how a human would behave in a specific situation.

And if it were just language, I would say, sure maybe this is more limited. But it seems like tensors can do a lot more than that. Poorly, but that may primarily be a hardware limitation. It also might be something about the way they work, but not something terribly different from what they are doing.

Also, I might talk about a meteorological simulation in terms of whatever it was intended to simulate.

76. sothatsit ◴[08 Jul 25 01:31 UTC] No.44496167{4}[source]▶

>>44494411 #

> First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16).

It sounds like the Authors were the one who brought this argument, not Anthropic? In which case, it seems like a big blunder on their part.

77. BoorishBears ◴[08 Jul 25 02:06 UTC] No.44496337[source]▶

>>44490075 #

You wrote this article and you're not familiar with hidden states?

replies(1): >>44497665 #

78. tdullien ◴[08 Jul 25 06:39 UTC] No.44497665{3}[source]▶

>>44496337 #

I am not aware that an LLM contains any.

79. ACCount36 ◴[08 Jul 25 09:55 UTC] No.44498567{7}[source]▶

>>44490852 #

KV cache is just that: a cache.

The inference logic of an LLM remains the same. There is no difference in outcomes between recalculating everything and caching. The only difference is in the amount of memory and computation required to do it.

replies(1): >>44501203 #

80. andy12_ ◴[08 Jul 25 15:57 UTC] No.44501203{8}[source]▶

>>44498567 #

The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.

81. Tarq0n ◴[08 Jul 25 20:10 UTC] No.44503614{4}[source]▶

>>44487399 #

I think that's almost completely backwards. The input and output layers just convert between natural language and embeddings i.e. shift the format of the language. But operating on the embeddings is where meaning (locations in vector-space) are transformed.

82. barrkel ◴[08 Jul 25 22:07 UTC] No.44504493{7}[source]▶

>>44489098 #

The LLM can be predict that it may lie, and when it sees tokens which are contrary to some correspondence with reality as it "understands" it, it may predict that the lie continues. It doesn't necessarily need to predict that it will reveal the lie. You can, after all, stop autoregressively producing tokens at any point, and the LLM may elect to produce an end of sequence token without revealing the lie.

Goals, such as they are, are essentially programs, or simulations, the LLM runs that help it predict (generate) future tokens.

Anyway, the whole original article is a rejection of anthropomorphism. I think the anthropomorphism is useful, but you still need to think of LLMs as deeply defective minds. And I totally reject the idea that they have intrinsic moral weight or consciousness or anything close to that.

83. barrkel ◴[08 Jul 25 22:20 UTC] No.44504559[source]▶

>>44490075 #

Yes, the context (along with the model weights) is the source data from which the hidden state is calculated , in an analogous way that input and CPU ticks (along with program code) is the way variables in a deterministic program get their value.

There's loads of state in the LLM that doesn't come out in the tokens it selects. The tokens are just the very top layer, and even then, you get to see just one selection from the possible tokens.

If you wish to anthropomorphize, that state - the set of activations, all the calculations that add up to the logits that determine the probability of the token to select, the whole lot of it - is what the model is "thinking". But all you get to see is one selected token.

Then, during autoregression, we run the program again, but one more tick of the CPU clock. Variables get updated a bit more. The chosen token from the previous pass conditions the next token prediction - the hidden state evolves its thinking one more step.

If you just look at the tokens being selected, you're missing this machinery. And the machinery is there. It's a program being ticked by generating tokens autoregressively. It has state which doesn't directly show up in tokens, it just informs which tokens to select. And the tokens it selects don't necessarily reflect the correspondences with perceived reality that the model is maintaining in that state. That's what I meant by talking about a lie.

We need a vocabulary to talk about this machinery. The machinery is learned, and it runs programs, effectively, that help the LLM reduce loss when predicting tokens. Since the tokens it's predicting come from human minds, the programs it's running are (broken, lossy, not very good) simulations of processes that seem to run inside human minds.

The simulations are pretty decent for producing gramatically correct text, for emulating tone and style, and so on. They're okay-ish for representing concepts. They're poor for representing very specific facts. But the overall point is they are simulations, and they have some analogous correspondence with human behavior, such that words we use to describe human behaviour are useful and practical.

They're not true, I'm not claiming that. But they're useful for talking about these weird defective minds we call LLMs.

↑