←back to thread

178 points themgt | 3 comments | | HN request time: 0s | source
Show context
majormajor ◴[] No.45776600[source]
So basically:

Provide a setup prompt "I am an interpretability researcher..." twice, and then send another string about starting a trial, but before one of those, directly fiddle with the model to activate neural bits consistent with ALL CAPS. Then ask it if it notices anything inconsistent with the string.

The naive question from me, a non-expert, is how appreciably different is this from having two different setup prompts, one with random parts in ALL CAPS, and then asking something like if there's anything incongruous about the tone of the setup text vs the context.

The predictions play off the previous state, so changing the state directly OR via prompt seems like both should produce similar results. The "introspect about what's weird compared to the text" bit is very curious - here I would love to know more about how the state is evaluated and how the model traces the state back to the previous conversation history when the do the new prompting. 20% "success" rate of course is very low overall, but it's interesting enough that even 20% is pretty high.

replies(2): >>45776675 #>>45778077 #
og_kalu ◴[] No.45776675[source]
>Then ask it if it notices anything inconsistent with the string.

They're not asking it if it notices anything about the output string. The idea is to inject the concept at an intensity where it's present but doesn't screw with the model's output distribution (i.e in the ALL CAPS example, the model doesn't start writing every word in ALL CAPS, so it can't just deduce the answer from the output).

The deduction is important distinction here. If the output is poisoned first, then anyone can deduce the right answer without special knowledge of Claude's internal state.

replies(2): >>45777699 #>>45779551 #
1. woopsn ◴[] No.45779551[source]
The output distribution is altered - it starts responding "yes" 20% of the time - and then, conditional on that is more or less steered by the "concept" vector?
replies(1): >>45779904 #
2. og_kalu ◴[] No.45779904[source]
You're asking it if it can feel the presence of an unusual thought. If it works, it's obviously not going to say the exact same thing it would have said without the question. That's not what is meant by 'alteration'.

It doesn't matter if it's 'altered' if the alteration doesn't point to the concept in question. It doesn't start spitting out content that will allow you to deduce the concept from the output alone. That's all that matters.

replies(1): >>45784641 #
3. woopsn ◴[] No.45784641[source]
They ask a yes/no question and inject data into the state. It goes yes (20%). The prompt does not reveal the concept as of yet, of course. The injected activations, in addition to the prompt, steer the rest of the response. SOMETIMES it SOUNDED LIKE introspection. Other times it sounded like physical sensory experience, which is only more clearly errant since the thing has no senses.

I think this technique is going to be valuable for controlling the output distribution, but I don't find their "introspection" framing helpful to understanding.