> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.
Finally, even a LLM can get lobotomised
replies(3):