Refusal in LLMs is mediated by a single direction

(www.lesswrong.com)

110 points veryluckyxyz | 3 comments | 03 May 24 00:55 UTC | HN request time: 0.622s | source

Show context

HanClinto ◴[03 May 24 14:57 UTC] No.40248418[source]▶

This is a really fascinating paper.

> Our hypothesis is that, across a wide range of harmful prompts, there is a single intermediate feature which is instrumental in the model’s refusal. In other words, many particular instances of harmful instructions lead to the expression of this "refusal feature," and once it is expressed in the residual stream, the model outputs text in a sort of "should refuse" mode.

At first blush it strikes me as a tenuous hypothesis, but really cool that it holds up. Fantastic work!

> 1) Run the model on harmful instructions and harmless instructions, caching all residual stream activations at the last token position > 2) Compute the difference in means between harmful activations and harmless activations.

This is dirt-simple, but awesome that it works!

> We can implement this as an inference-time intervention: every time a component c (e.g. an attention head) writes its output c out ∈ R d model to the residual stream, we can erase its contribution to the "refusal direction" ^r. We can do this by computing the projection of c out onto ^r, and then subtracting this projection away: > Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.

This is definitely the "big-hammer" approach, and while it no doubt would give the best results, I wonder if simply ablating the refusal vector at the final activation layer would be sufficient...? I would be interested in seeing experiments about this -- if that were the case, then this would certainly be easier to reproduce, because the lift would be much lower.

Regardless, I'm still somewhat new to LLMs, but it feels like this is the sort of paper that we should be able to reproduce in something like llama.cpp without too much trouble...? And the best part is, there's no retraining / fine-tuning involved -- we simple need to feed in a number of prompts that we want to find the common refusal vector for, a number of innocuous prompts, mash them together, and then feed that in as an additional parameter for the engine to ablate at inference time. Boom, instant de-censorship!

replies(2): >>40249722 #>>40252150 #

amluto ◴[03 May 24 16:51 UTC] No.40249722[source]▶

>>40248418 #

> We can do this by computing the projection of c out onto ^r, and then subtracting this projection away

That looks exactly equivalent to multiplying by a matrix that nulls out that vector and preserves everything else. (This is trivial linear algebra!) One could presumably multiply such a matrix into the model weights to get exactly the same effect, and then one could run the model using any inference engine.

Of course, the project-and-subtract formulation is faster for each projection, and one could premultiply the matrices by projecting-and-subtracting each row or column (depending on which side one wants to multiply on) using the project-and-subtract trick. This would make computing the new weights very fast, even with a slow CPU and no GPU.

replies(1): >>40249978 #

1. HanClinto ◴[03 May 24 17:17 UTC] No.40249978[source]▶

>>40249722 #

> That looks exactly equivalent to multiplying by a matrix that nulls out that vector and preserves everything else. (This is trivial linear algebra!) One could presumably multiply such a matrix into the model weights to get exactly the same effect, and then one could run the model using any inference engine.

Oh fascinating -- so almost like LoRa weight-adjustment being added to a fully trained model after-the-fact?

replies(1): >>40250395 #

2. amluto ◴[03 May 24 17:56 UTC] No.40250395[source]▶

>>40249978 (TP) #

It’s certainly a low rank fine-tune — the weight difference would be rank 1! But I think it’s more useful to think of it as a multiplicative change, not an additive change.

replies(1): >>40251760 #

3. HanClinto ◴[03 May 24 20:03 UTC] No.40251760[source]▶

>>40250395 #

Nice, thank you! You're adding to my reading list, and I appreciate that! :)

I'm still mulling over how difficult it would be to reimplement this with "stock" llama.cpp.

It feels like the first step would be to essentially get the "super-embeddings" for each prompt -- instead of grabbing just the text embeddings (which I understand is usually only the narrowest layer?) -- we would want to store off the activations for every layer. Then save them all to a list, average them together, and then figure out a way to use that to modify the weights of the model -- either at runtime (I imagine this much like a current guidance-vector would be loaded), or else (as you suggested) write a script to bake the modifications into the core model (but using multiplication rather than addition).

Does that match your understanding?

Thank you very much for helping me think this through!

↑