Signs of introspection in large language models

(www.anthropic.com)

178 points themgt | 4 comments | 30 Oct 25 16:45 UTC | HN request time: 0s | source

Show context

sunir ◴[31 Oct 25 02:44 UTC] No.45767774[source]▶

>>45762064 (OP) #

Even if their introspection within the inference step is limited, by looping over a core set of documents that the agent considers itself, it can observe changes in the output and analyze those changes to deduce facts about its internal state.

You may have experienced this when the llms get hopelessly confused and then you ask it what happened. The llm reads the chat transcript and gives an answer as consistent with the text as it can.

The model isn’t the active part of the mind. The artifacts are.

This is the same as Searles Chinese room. The intelligence isn’t in the clerk but the book. However the thinking is in the paper.

The Turing machine equivalent is the state table (book, model), the read/write/move head (clerk, inference) and the tape (paper, artifact).

Thus it isn’t mystical that the AIs can introspect. It’s routine and frequently observed in my estimation.

replies(2): >>45770660 #>>45793812 #

1. creatonez ◴[31 Oct 25 11:02 UTC] No.45770660[source]▶

>>45767774 #

This seems to be missing the point? What you're describing is the obvious form of introspection that makes sense for a word predictor to be capable of. It's the type of introspection that we consider easy to fake, the same way split-brained patients confabulate reasons why the other side of their body did something. Once anomalous output has been fed back into itself, we can't prove that it didn't just confabulate an explanation. But what seemingly happened here is the model making a determination (yes or no) on whether a concept was injected in just a single token. It didn't do this by detecting an anomaly in its output, because up until that point it hadn't output anything - instead, the determination was derived from its internal state.

replies(2): >>45770945 #>>45771835 #

2. Libidinalecon ◴[31 Oct 25 11:44 UTC] No.45770945[source]▶

>>45770660 (TP) #

I have to admit I am not really understanding what this paper is trying to show.

Edit: Ok I think I understand. The main issue I would say is this is a misuse of the word "introspection".

replies(1): >>45780120 #

3. sunir ◴[31 Oct 25 13:36 UTC] No.45771835[source]▶

>>45770660 (TP) #

Sure I agree what I am talking about is different in some important ways; I am “yes and”ing here. It’s an interesting space for sure.

Internal vs external in this case is a subjective decision. Where there is a boundary, within it is the model. If you draw the boundary outside the texts then the complete system of model, inference, text documents form the agent.

I liken this to a “text wave” by metaphor. If you keep feeding in the same text into the model and have the model emit updates to the same text, then there is continuity. The text wave propagates forward and can react and learn and adapt.

The introspection within the neural net is similar except over an internal representation. Our human system is similar I believe as a layer observing another layer.

I think that is really interesting as well.

The “yes and” part is you can have more fun playing with the models ability to analyze their own thinking by using the “text wave” idea.

4. baq ◴[01 Nov 25 08:37 UTC] No.45780120[source]▶

>>45770945 #

I think it’s perfectly clear: the model must know it’s been tampered with because it reports tampering before it reports which concept has been injected into its internal state. It can only do this if it has introspection capabilities.

↑