Uncensor any LLM with abliteration

(huggingface.co)

586 points mizzao | 2 comments | 13 Jun 24 03:42 UTC | HN request time: 0.429s | source

Show context

Der_Einzige ◴[13 Jun 24 04:41 UTC] No.40666019[source]▶

Ironic given that lesswrong folks who presented this did so as part of their mission of motivating policy makers to ban open access to models. Hate their ideology but love their research!

Edit: The data format is the same type used for DPO or RLHF style training. “Good” and “bad”, “harmful” vs “harmless”. What’s fun is to test the performance of this technique using your own datasets, to see how good the personalization is.

replies(2): >>40667531 #>>40669580 #

1. TeMPOraL ◴[13 Jun 24 13:47 UTC] No.40669580[source]▶

>>40666019 #

What better way to drive the point home, to demonstrate that corporate claims of safety and oversight are empty lies and fundamentally futile, than to take a SOTA OSS LLM and break it open, shortly after its release, using a simple method that likely generalizes to all generative models, language or otherwise?

replies(1): >>40683028 #

2. ben_w ◴[14 Jun 24 17:48 UTC] No.40683028[source]▶

>>40669580 (TP) #

Well, those specific corporate claims of safety and oversight when the model is downloadable.

I know OpenAI gets a lot of flak for not letting people download their model weights, but this is kinda why I agree with them in principle. In practice, so far, it seems that even the best model isn't a threat; but if the models are downloadable, we'll only know that any given model is a threat when it's too late to do anything about it.

I think the only way to be sure a sufficiently powerful model is "safe" to distribute is something which might be impossible: unless and until we know how to make a model such that its concept of good and evil* cannot be removed even by someone who has read and write access to all the weights, I expect someone to be able to find the equivalent of the "good/evil" switch and change it between them whenever they feel like it.

* for the purpose of this discussion: it does not matter whose concept of good and evil the AI is aligned with, given the point is that I expect it can be deleted regardless.

↑