Uncensor any LLM with abliteration

1. schoen ◴[13 Jun 24 04:42 UTC] No.40666023[source]▶

This is really interesting and is parallel to some other stuff (like the research on a model that's obsessed with the Golden Gate Bridge and inappropriately thinks of things related to it in otherwise irrelevant contexts).

It's worth mentioning that this technique is usable if you have the model weights (it's a simple way of changing the weights or how to use them):

> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

It's not (and doesn't claim to be) a technique for convincing a model to change its behavior through prompts.

replies(1): >>40666319 #

2. kromem ◴[13 Jun 24 05:45 UTC] No.40666319[source]▶

>>40666023 (TP) #

What's interesting was how with GGC the model would spit out things relating to the enhanced feature vector, but would then in-context end up self-correcting and attempt to correct for the bias.

I'm extremely curious if as models scale in complexity if techniques like this will start to become less and less effective as net model representations collapse onto an enforced alignment (which may differ from the 'safety' trained alignment, but be an inherent pretrained alignment that can't be easily overcome without gutting model capabilities too).

I have a sneaking suspicion this will be the case.

replies(3): >>40666542 #>>40666621 #>>40671938 #

3. metadat ◴[13 Jun 24 06:25 UTC] No.40666542[source]▶

>>40666319 #

What's GGC in this context?

replies(1): >>40666586 #

4. dannyobrien ◴[13 Jun 24 06:34 UTC] No.40666586{3}[source]▶

>>40666542 #

Golden Gate Claude

5. rileyphone ◴[13 Jun 24 06:40 UTC] No.40666621[source]▶

>>40666319 #

In that case there are two attractors - one towards the Golden Gate Bridge and one towards the harmless, helpful, honest assistant persona. Techniques as such probably get weirder results with model scale but no reason to think they get wiped out.

replies(1): >>40667458 #

6. coldtea ◴[13 Jun 24 09:05 UTC] No.40667458{3}[source]▶

>>40666621 #

What if the Golden Gate Bridge is Main Kampf or something like that?

7. wongarsu ◴[13 Jun 24 16:53 UTC] No.40671938[source]▶

>>40666319 #

The preferred technique seems to still be to train a base model on any data you can get your hands on, and add the "safety" alignment as a second training step. As long as that alignment is a small fine tuning compared to the initial training I wouldn't be worried about the model losing the ability to be uncensored.