Signs of introspection in large language models

(www.anthropic.com)

Show context

majormajor ◴[31 Oct 25 20:56 UTC] No.45776600[source]▶

>>45762064 (OP) #

So basically:

Provide a setup prompt "I am an interpretability researcher..." twice, and then send another string about starting a trial, but before one of those, directly fiddle with the model to activate neural bits consistent with ALL CAPS. Then ask it if it notices anything inconsistent with the string.

The naive question from me, a non-expert, is how appreciably different is this from having two different setup prompts, one with random parts in ALL CAPS, and then asking something like if there's anything incongruous about the tone of the setup text vs the context.

The predictions play off the previous state, so changing the state directly OR via prompt seems like both should produce similar results. The "introspect about what's weird compared to the text" bit is very curious - here I would love to know more about how the state is evaluated and how the model traces the state back to the previous conversation history when the do the new prompting. 20% "success" rate of course is very low overall, but it's interesting enough that even 20% is pretty high.

replies(2): >>45776675 #>>45778077 #

1. og_kalu ◴[31 Oct 25 21:04 UTC] No.45776675[source]▶

>>45776600 #

>Then ask it if it notices anything inconsistent with the string.

They're not asking it if it notices anything about the output string. The idea is to inject the concept at an intensity where it's present but doesn't screw with the model's output distribution (i.e in the ALL CAPS example, the model doesn't start writing every word in ALL CAPS, so it can't just deduce the answer from the output).

The deduction is important distinction here. If the output is poisoned first, then anyone can deduce the right answer without special knowledge of Claude's internal state.

replies(2): >>45777699 #>>45779551 #

2. XenophileJKO ◴[31 Oct 25 23:06 UTC] No.45777699[source]▶

>>45776675 (TP) #

I need to read the full paper.. but it is interesting.. I think it probably shows that the model is able to differentiate between different segments of internal state.

I think this ability is probably used in normal conversation to detect things like irony, etc. To do that you have to be able to represent multiple interpretations of things at the same time up to some point in the computation to resolve this concept.

Edit: Was reading the paper. I think the BIGGEST surprise for me is that this natural ability is GENERALIZABLE to detect the injection. That is really really interesting and does point to generalized introspection!

Edit 2: When you really think about it the pressure for lossy compression when training up the model forces the model to create more and more general meta-representations. That more efficiently provide the behavior contours.. and it turns out that generalized metacognition is one of those.

replies(1): >>45778357 #

3. empath75 ◴[01 Nov 25 00:58 UTC] No.45778357[source]▶

>>45777699 #

I wonder if it is just sort of detecting a weird distribution in the state and that it wouldn’t be able to do it if the idea were conceptually closer to what they were asked about.

replies(1): >>45779797 #

4. woopsn ◴[01 Nov 25 05:59 UTC] No.45779551[source]▶

>>45776675 (TP) #

The output distribution is altered - it starts responding "yes" 20% of the time - and then, conditional on that is more or less steered by the "concept" vector?

replies(1): >>45779904 #

5. XenophileJKO ◴[01 Nov 25 07:15 UTC] No.45779797{3}[source]▶

>>45778357 #

That "just sort of detecting" IS the introspection, and that is amazing, at least to me. I'm a big fan of the state of the art of the models, but I didn't anticipate this generalized ability to introspect. I just figured the introspection talk was simulated, but not actual introspection, but it appears it is much more complicated. I'm impressed.

6. og_kalu ◴[01 Nov 25 07:45 UTC] No.45779904[source]▶

>>45779551 #

You're asking it if it can feel the presence of an unusual thought. If it works, it's obviously not going to say the exact same thing it would have said without the question. That's not what is meant by 'alteration'.

It doesn't matter if it's 'altered' if the alteration doesn't point to the concept in question. It doesn't start spitting out content that will allow you to deduce the concept from the output alone. That's all that matters.

replies(1): >>45784641 #

7. woopsn ◴[01 Nov 25 19:39 UTC] No.45784641{3}[source]▶

>>45779904 #

They ask a yes/no question and inject data into the state. It goes yes (20%). The prompt does not reveal the concept as of yet, of course. The injected activations, in addition to the prompt, steer the rest of the response. SOMETIMES it SOUNDED LIKE introspection. Other times it sounded like physical sensory experience, which is only more clearly errant since the thing has no senses.

I think this technique is going to be valuable for controlling the output distribution, but I don't find their "introspection" framing helpful to understanding.

↑