←back to thread

586 points mizzao | 1 comments | | HN request time: 0.203s | source
Show context
YukiElectronics ◴[] No.40667983[source]
> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

Finally, even a LLM can get lobotomised

replies(3): >>40668220 #>>40669226 #>>40676978 #
1. m463 ◴[] No.40676978[source]
wouldn't that be ablateration?