It's worth mentioning that this technique is usable if you have the model weights (it's a simple way of changing the weights or how to use them):
> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.
It's not (and doesn't claim to be) a technique for convincing a model to change its behavior through prompts.