←back to thread

586 points mizzao | 1 comments | | HN request time: 0.204s | source
Show context
astrange ◴[] No.40666492[source]
There was a recent paper about a way to censor LLMs by just deleting the connections to any bad outputs, rather than training it to refuse them. I think this technique wouldn't work.

Obviously you could train any bad outputs back into them if you have the model weights.

replies(1): >>40671194 #
1. stainablesteel ◴[] No.40671194[source]
interesting, there's going to be an arms race over censoring and uncensoring future powerful llms a lot like the getting a cracked version of photoshop back in the day