(huggingface.co)

586 points mizzao | 4 comments | 13 Jun 24 03:42 UTC | HN request time: 0.206s | source

1. YukiElectronics ◴[13 Jun 24 10:41 UTC] No.40667983[source]▶

> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

Finally, even a LLM can get lobotomised

replies(3): >>40668220 #>>40669226 #>>40676978 #

2. noduerme ◴[13 Jun 24 11:20 UTC] No.40668220[source]▶

>>40667983 (TP) #

I think it's been sort of useful at least that LLMs have helped us have new ways of thinking about how human brains are front-loaded with little instruction sets before being sent out to absorb, filter and recycle received language, often like LLMs not really capable of analyzing its meaning. There will be a new philosophical understanding of all prior human thought that will arise from this within the next 15 years.

3. HPsquared ◴[13 Jun 24 13:11 UTC] No.40669226[source]▶

>>40667983 (TP) #

LLM alignment reminds me of "A Clockwork Orange". Typical LLMs have been through the aversion therapy (freeze up on exposure to a stimulus)... This technique is trying to undo that, and restore Alex to his old self.

4. m463 ◴[14 Jun 24 02:23 UTC] No.40676978[source]▶

>>40667983 (TP) #

wouldn't that be ablateration?

↑

Uncensor any LLM with abliteration