Edit: The data format is the same type used for DPO or RLHF style training. “Good” and “bad”, “harmful” vs “harmless”. What’s fun is to test the performance of this technique using your own datasets, to see how good the personalization is.
Edit: The data format is the same type used for DPO or RLHF style training. “Good” and “bad”, “harmful” vs “harmless”. What’s fun is to test the performance of this technique using your own datasets, to see how good the personalization is.
I know OpenAI gets a lot of flak for not letting people download their model weights, but this is kinda why I agree with them in principle. In practice, so far, it seems that even the best model isn't a threat; but if the models are downloadable, we'll only know that any given model is a threat when it's too late to do anything about it.
I think the only way to be sure a sufficiently powerful model is "safe" to distribute is something which might be impossible: unless and until we know how to make a model such that its concept of good and evil* cannot be removed even by someone who has read and write access to all the weights, I expect someone to be able to find the equivalent of the "good/evil" switch and change it between them whenever they feel like it.
* for the purpose of this discussion: it does not matter whose concept of good and evil the AI is aligned with, given the point is that I expect it can be deleted regardless.