Refusal in LLMs is mediated by a single direction

(www.lesswrong.com)

1. HanClinto ◴[03 May 24 14:57 UTC] No.40248418[source]▶

This is a really fascinating paper.

> Our hypothesis is that, across a wide range of harmful prompts, there is a single intermediate feature which is instrumental in the model’s refusal. In other words, many particular instances of harmful instructions lead to the expression of this "refusal feature," and once it is expressed in the residual stream, the model outputs text in a sort of "should refuse" mode.

At first blush it strikes me as a tenuous hypothesis, but really cool that it holds up. Fantastic work!

> 1) Run the model on harmful instructions and harmless instructions, caching all residual stream activations at the last token position > 2) Compute the difference in means between harmful activations and harmless activations.

This is dirt-simple, but awesome that it works!

> We can implement this as an inference-time intervention: every time a component c (e.g. an attention head) writes its output c out ∈ R d model to the residual stream, we can erase its contribution to the "refusal direction" ^r. We can do this by computing the projection of c out onto ^r, and then subtracting this projection away: > Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.

This is definitely the "big-hammer" approach, and while it no doubt would give the best results, I wonder if simply ablating the refusal vector at the final activation layer would be sufficient...? I would be interested in seeing experiments about this -- if that were the case, then this would certainly be easier to reproduce, because the lift would be much lower.

Regardless, I'm still somewhat new to LLMs, but it feels like this is the sort of paper that we should be able to reproduce in something like llama.cpp without too much trouble...? And the best part is, there's no retraining / fine-tuning involved -- we simple need to feed in a number of prompts that we want to find the common refusal vector for, a number of innocuous prompts, mash them together, and then feed that in as an additional parameter for the engine to ablate at inference time. Boom, instant de-censorship!

replies(2): >>40249722 #>>40252150 #

2. hdhdhsjsbdh ◴[03 May 24 15:26 UTC] No.40248781[source]▶

>>40242939 (OP) #

Beyond its immediate appeal to the (somewhat cringy imo) “uncensored model” crowd, this has immediate practical use for improving data synthesis. I have had several experiences trying to create synthetic data for harmless or benign tasks, only to have noise introduced from overly conservative refusals.

replies(2): >>40249310 #>>40249750 #

3. HanClinto ◴[03 May 24 16:12 UTC] No.40249310[source]▶

>>40248781 #

I agree -- people often hear "uncensored model" and immediately jump to all sorts of places, but there are very practical use-cases that benefit from unhindered models.

In my case, we're attempting to use multi-modal models essentially for NSFW-detection with quantified degrees of understanding about the subjects in question (for a research paper involving historical classic art). Model censorship tends to not want to let us ask _any_ questions about such subject matter, and it has greatly limited the choice of models that we can use.

Being able to easily turn censorship off for local language models would be a great boost to our workflow, and we might not have to tiptoe around the prompt engine so carefully.

4. amluto ◴[03 May 24 16:51 UTC] No.40249722[source]▶

>>40248418 #

> We can do this by computing the projection of c out onto ^r, and then subtracting this projection away

That looks exactly equivalent to multiplying by a matrix that nulls out that vector and preserves everything else. (This is trivial linear algebra!) One could presumably multiply such a matrix into the model weights to get exactly the same effect, and then one could run the model using any inference engine.

Of course, the project-and-subtract formulation is faster for each projection, and one could premultiply the matrices by projecting-and-subtracting each row or column (depending on which side one wants to multiply on) using the project-and-subtract trick. This would make computing the new weights very fast, even with a slow CPU and no GPU.

replies(1): >>40249978 #

5. amluto ◴[03 May 24 16:54 UTC] No.40249750[source]▶

>>40248781 #

I encountered this in an absurd context — I wanted a model (IIRC GPT 3.5) to make me some invalid UTF-8 strings. It refused! On safety grounds! After a couple minutes of fiddling, the refusal was surprisingly robust, although I admit I didn’t try litany of the usual model jailbreaking techniques.

On the one hand, good job OpenAI for training the model decently robustly. On the other hand, this entirely misses the point of “AI safety”.

replies(1): >>40252359 #

6. d13 ◴[03 May 24 17:00 UTC] No.40249812[source]▶

>>40242939 (OP) #

This was my implemented here. I tested it and it works:

https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-ex...

replies(1): >>40252369 #

7. HanClinto ◴[03 May 24 17:17 UTC] No.40249978{3}[source]▶

>>40249722 #

> That looks exactly equivalent to multiplying by a matrix that nulls out that vector and preserves everything else. (This is trivial linear algebra!) One could presumably multiply such a matrix into the model weights to get exactly the same effect, and then one could run the model using any inference engine.

Oh fascinating -- so almost like LoRa weight-adjustment being added to a fully trained model after-the-fact?

replies(1): >>40250395 #

8. amluto ◴[03 May 24 17:56 UTC] No.40250395{4}[source]▶

>>40249978 #

It’s certainly a low rank fine-tune — the weight difference would be rank 1! But I think it’s more useful to think of it as a multiplicative change, not an additive change.

replies(1): >>40251760 #

9. luke-stanley ◴[03 May 24 19:46 UTC] No.40251554[source]▶

>>40242939 (OP) #

The Classifier-Free Guidance (CFG) feature in llama.cpp likely acts as a built-in way to do something like using the "reverse-prompt" / "cfg-negative-prompt" flags in "main".

10. HanClinto ◴[03 May 24 20:03 UTC] No.40251760{5}[source]▶

>>40250395 #

Nice, thank you! You're adding to my reading list, and I appreciate that! :)

I'm still mulling over how difficult it would be to reimplement this with "stock" llama.cpp.

It feels like the first step would be to essentially get the "super-embeddings" for each prompt -- instead of grabbing just the text embeddings (which I understand is usually only the narrowest layer?) -- we would want to store off the activations for every layer. Then save them all to a list, average them together, and then figure out a way to use that to modify the weights of the model -- either at runtime (I imagine this much like a current guidance-vector would be loaded), or else (as you suggested) write a script to bake the modifications into the core model (but using multiplication rather than addition).

Does that match your understanding?

Thank you very much for helping me think this through!

11. scotty79 ◴[03 May 24 20:41 UTC] No.40252150[source]▶

>>40248418 #

> Our hypothesis is that, across a wide range of harmful prompts, there is a single intermediate feature which is instrumental in the model’s refusal.

AI learned to successfully recognize puritanism.

12. HanClinto ◴[03 May 24 21:02 UTC] No.40252359{3}[source]▶

>>40249750 #

Reminds me of this nugget of Prime reacting to Gemini refusing to show C++ code to teenagers because it is "unsafe":

https://www.youtube.com/watch?v=r2npdV6tX1g

13. HanClinto ◴[03 May 24 21:03 UTC] No.40252369[source]▶

>>40249812 #

Nice work!!

Did you push the source that you used to make this? I would be interested in following along.

14. lolc ◴[03 May 24 22:32 UTC] No.40253092[source]▶

>>40242939 (OP) #

Love this! I have a tenuous understanding of how these models work, and this paper cuts at an interesting angle.

15. olliej ◴[04 May 24 04:45 UTC] No.40254915[source]▶

>>40242939 (OP) #

I still don't understand why this is implemented as a "don't answer these questions" filter because that clearly just means "game the query to make it pass the ban list".

Surely having a separate system run on the output that goes "does this answer say something I don't want the AI to say?" and stopping the stream (and adding the original query to a training set for future iterations) would be more effective?

replies(1): >>40255592 #

16. Vecr ◴[04 May 24 07:10 UTC] No.40255592[source]▶

>>40254915 #

That's something a hosted product could do, but if you have access to the weights you just don't run the additional filtering code. Also, the filtering code would be expensive to run, if you want it to be good quality. The result of the paper is that anyone with a reasonably good computer can very quickly strip out the "protection" of the model and get it to do whatever they want, assuming you can get a weights download.

replies(2): >>40263987 #>>40267502 #

17. alex_duf ◴[05 May 24 11:18 UTC] No.40263987{3}[source]▶

>>40255592 #

I think we need to accept there's no"safe" publishing once the weights are released.

So either we want safe AI and it's behind gated services held by private companies, or it's the complete wild west with open models.

I don't know if they'd any situation somewhere on the middle, and I'm not judging which outcome is preferable, I personally have no clue what's best...

replies(1): >>40266493 #

18. Vecr ◴[05 May 24 17:20 UTC] No.40266493{4}[source]▶

>>40263987 #

As far as I know there's nothing in the middle.

19. olliej ◴[05 May 24 19:34 UTC] No.40267502{3}[source]▶

>>40255592 #

What I mean is as currently done it seems to be "try to identify if the request is asking for something we don't want to answer" rather than "verify the produced output doesn't contain anything we don't want", the latter seems much more robust, and wouldn't need anything like the computation power of content generation. I'm sure some degree of input filtering would minimize wasted work, but a simple (at the level that's been trivially possible for more than a decade now) output check seems like it would be much more robust, cheap, and much harder to circumvent - the post check is not trying to interpret the request just blocking the output

replies(1): >>40268108 #

20. Vecr ◴[05 May 24 20:37 UTC] No.40268108{4}[source]▶

>>40267502 #

The model weights themselves can't do that, it's just not how single-pass token-by-token generation works. Even if you got rid of tokens and went byte-by-byte there's no improvement. To do what you are saying you'd have to keep model weights secret forever. I hope your private military forces defending the datacenter are well trained and motivated.