(huggingface.co)

586 points mizzao | 2 comments | 13 Jun 24 03:42 UTC | HN request time: 0.43s | source

1. astrange ◴[13 Jun 24 06:18 UTC] No.40666492[source]▶

There was a recent paper about a way to censor LLMs by just deleting the connections to any bad outputs, rather than training it to refuse them. I think this technique wouldn't work.

Obviously you could train any bad outputs back into them if you have the model weights.

replies(1): >>40671194 #

2. stainablesteel ◴[13 Jun 24 15:55 UTC] No.40671194[source]▶

>>40666492 (TP) #

interesting, there's going to be an arms race over censoring and uncensoring future powerful llms a lot like the getting a cracked version of photoshop back in the day

↑

Uncensor any LLM with abliteration