Signs of introspection in large language models

1. munro ◴[31 Oct 25 21:16 UTC] No.45776788[source]▶

>>45762064 (OP) #

I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog

> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?

replies(2): >>45778148 #>>45782167 #

2. johntb86 ◴[01 Nov 25 00:19 UTC] No.45778148[source]▶

>>45776788 (TP) #

That's a fair point. Normally if you injected the "dog" token, that would cause a set of values to be populated into the kv cache, and those would later be picked up by the attention layers. The question is what's fundamentally different if you inject something into the activations instead?

I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.

3. DangitBobby ◴[01 Nov 25 15:02 UTC] No.45782167[source]▶

>>45776788 (TP) #

It's more like someone whispered dog into your ears while you were unconscious, and you were unable to recall any conversation but for some reason you were thinking about dogs. The thought didn't enter your head through a mechanism where you could register it happening so knowing it's there depends on your ability to examine your own internal states, i.e., introspect.

replies(1): >>45783868 #

4. munro ◴[01 Nov 25 18:09 UTC] No.45783868[source]▶

>>45782167 #

I'm more looking at the problem more like code

https://bbycroft.net/llm

My immediate thought is when the model responds "Oh I'm thinking about X"... that X isn't from the input, it's from attention, and thinking this experiment is simply injecting that token right after the input step into attn--but who knows how they select which weights