←back to thread

Signs of introspection in large language models

(www.anthropic.com)

178 points themgt | 1 comments | 30 Oct 25 16:45 UTC | HN request time: 0s | source

Show context

munro ◴[31 Oct 25 21:16 UTC] No.45776788[source]▶

>>45762064 (OP) #

I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog

> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?

replies(2): >>45778148 #>>45782167 #

1. johntb86 ◴[01 Nov 25 00:19 UTC] No.45778148[source]▶

That's a fair point. Normally if you injected the "dog" token, that would cause a set of values to be populated into the kv cache, and those would later be picked up by the attention layers. The question is what's fundamentally different if you inject something into the activations instead?

I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.