←back to thread

110 points veryluckyxyz | 2 comments | | HN request time: 0.435s | source
Show context
olliej ◴[] No.40254915[source]
I still don't understand why this is implemented as a "don't answer these questions" filter because that clearly just means "game the query to make it pass the ban list".

Surely having a separate system run on the output that goes "does this answer say something I don't want the AI to say?" and stopping the stream (and adding the original query to a training set for future iterations) would be more effective?

replies(1): >>40255592 #
Vecr ◴[] No.40255592[source]
That's something a hosted product could do, but if you have access to the weights you just don't run the additional filtering code. Also, the filtering code would be expensive to run, if you want it to be good quality. The result of the paper is that anyone with a reasonably good computer can very quickly strip out the "protection" of the model and get it to do whatever they want, assuming you can get a weights download.
replies(2): >>40263987 #>>40267502 #
1. olliej ◴[] No.40267502[source]
What I mean is as currently done it seems to be "try to identify if the request is asking for something we don't want to answer" rather than "verify the produced output doesn't contain anything we don't want", the latter seems much more robust, and wouldn't need anything like the computation power of content generation. I'm sure some degree of input filtering would minimize wasted work, but a simple (at the level that's been trivially possible for more than a decade now) output check seems like it would be much more robust, cheap, and much harder to circumvent - the post check is not trying to interpret the request just blocking the output
replies(1): >>40268108 #
2. Vecr ◴[] No.40268108[source]
The model weights themselves can't do that, it's just not how single-pass token-by-token generation works. Even if you got rid of tokens and went byte-by-byte there's no improvement. To do what you are saying you'd have to keep model weights secret forever. I hope your private military forces defending the datacenter are well trained and motivated.