Refusal in LLMs is mediated by a single direction

1. olliej ◴[04 May 24 04:45 UTC] No.40254915[source]▶

I still don't understand why this is implemented as a "don't answer these questions" filter because that clearly just means "game the query to make it pass the ban list".

Surely having a separate system run on the output that goes "does this answer say something I don't want the AI to say?" and stopping the stream (and adding the original query to a training set for future iterations) would be more effective?

replies(1): >>40255592 #

2. Vecr ◴[04 May 24 07:10 UTC] No.40255592[source]▶

>>40254915 (TP) #

That's something a hosted product could do, but if you have access to the weights you just don't run the additional filtering code. Also, the filtering code would be expensive to run, if you want it to be good quality. The result of the paper is that anyone with a reasonably good computer can very quickly strip out the "protection" of the model and get it to do whatever they want, assuming you can get a weights download.

replies(2): >>40263987 #>>40267502 #

3. alex_duf ◴[05 May 24 11:18 UTC] No.40263987[source]▶

>>40255592 #

I think we need to accept there's no"safe" publishing once the weights are released.

So either we want safe AI and it's behind gated services held by private companies, or it's the complete wild west with open models.

I don't know if they'd any situation somewhere on the middle, and I'm not judging which outcome is preferable, I personally have no clue what's best...

replies(1): >>40266493 #

4. Vecr ◴[05 May 24 17:20 UTC] No.40266493{3}[source]▶

>>40263987 #

As far as I know there's nothing in the middle.

5. olliej ◴[05 May 24 19:34 UTC] No.40267502[source]▶

>>40255592 #

What I mean is as currently done it seems to be "try to identify if the request is asking for something we don't want to answer" rather than "verify the produced output doesn't contain anything we don't want", the latter seems much more robust, and wouldn't need anything like the computation power of content generation. I'm sure some degree of input filtering would minimize wasted work, but a simple (at the level that's been trivially possible for more than a decade now) output check seems like it would be much more robust, cheap, and much harder to circumvent - the post check is not trying to interpret the request just blocking the output

replies(1): >>40268108 #

6. Vecr ◴[05 May 24 20:37 UTC] No.40268108{3}[source]▶

>>40267502 #

The model weights themselves can't do that, it's just not how single-pass token-by-token generation works. Even if you got rid of tokens and went byte-by-byte there's no improvement. To do what you are saying you'd have to keep model weights secret forever. I hope your private military forces defending the datacenter are well trained and motivated.