There was a recent paper about a way to censor LLMs by just deleting the connections to any bad outputs, rather than training it to refuse them. I think this technique wouldn't work.
Obviously you could train any bad outputs back into them if you have the model weights.
replies(1):