←back to thread

586 points mizzao | 1 comments | | HN request time: 0.225s | source
Show context
schoen ◴[] No.40666023[source]
This is really interesting and is parallel to some other stuff (like the research on a model that's obsessed with the Golden Gate Bridge and inappropriately thinks of things related to it in otherwise irrelevant contexts).

It's worth mentioning that this technique is usable if you have the model weights (it's a simple way of changing the weights or how to use them):

> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

It's not (and doesn't claim to be) a technique for convincing a model to change its behavior through prompts.

replies(1): >>40666319 #
kromem ◴[] No.40666319[source]
What's interesting was how with GGC the model would spit out things relating to the enhanced feature vector, but would then in-context end up self-correcting and attempt to correct for the bias.

I'm extremely curious if as models scale in complexity if techniques like this will start to become less and less effective as net model representations collapse onto an enforced alignment (which may differ from the 'safety' trained alignment, but be an inherent pretrained alignment that can't be easily overcome without gutting model capabilities too).

I have a sneaking suspicion this will be the case.

replies(3): >>40666542 #>>40666621 #>>40671938 #
rileyphone ◴[] No.40666621[source]
In that case there are two attractors - one towards the Golden Gate Bridge and one towards the harmless, helpful, honest assistant persona. Techniques as such probably get weirder results with model scale but no reason to think they get wiped out.
replies(1): >>40667458 #
1. coldtea ◴[] No.40667458[source]
What if the Golden Gate Bridge is Main Kampf or something like that?