←back to thread

586 points mizzao | 2 comments | | HN request time: 1.025s | source
1. astrange ◴[] No.40666492[source]
There was a recent paper about a way to censor LLMs by just deleting the connections to any bad outputs, rather than training it to refuse them. I think this technique wouldn't work.

Obviously you could train any bad outputs back into them if you have the model weights.

replies(1): >>40671194 #
2. stainablesteel ◴[] No.40671194[source]
interesting, there's going to be an arms race over censoring and uncensoring future powerful llms a lot like the getting a cracked version of photoshop back in the day