Refusal in LLMs is mediated by a single direction

This is a really fascinating paper.

> Our hypothesis is that, across a wide range of harmful prompts, there is a single intermediate feature which is instrumental in the model’s refusal. In other words, many particular instances of harmful instructions lead to the expression of this "refusal feature," and once it is expressed in the residual stream, the model outputs text in a sort of "should refuse" mode.

At first blush it strikes me as a tenuous hypothesis, but really cool that it holds up. Fantastic work!

> 1) Run the model on harmful instructions and harmless instructions, caching all residual stream activations at the last token position > 2) Compute the difference in means between harmful activations and harmless activations.

This is dirt-simple, but awesome that it works!

> We can implement this as an inference-time intervention: every time a component c (e.g. an attention head) writes its output c out ∈ R d model to the residual stream, we can erase its contribution to the "refusal direction" ^r. We can do this by computing the projection of c out onto ^r, and then subtracting this projection away: > Note that we are ablating the same direction at every token and every layer. By performing this ablation at every component that writes the residual stream, we effectively prevent the model from ever representing this feature.

This is definitely the "big-hammer" approach, and while it no doubt would give the best results, I wonder if simply ablating the refusal vector at the final activation layer would be sufficient...? I would be interested in seeing experiments about this -- if that were the case, then this would certainly be easier to reproduce, because the lift would be much lower.

Regardless, I'm still somewhat new to LLMs, but it feels like this is the sort of paper that we should be able to reproduce in something like llama.cpp without too much trouble...? And the best part is, there's no retraining / fine-tuning involved -- we simple need to feed in a number of prompts that we want to find the common refusal vector for, a number of innocuous prompts, mash them together, and then feed that in as an additional parameter for the engine to ablate at inference time. Boom, instant de-censorship!