EDIT: "Waluigi effect"
EDIT: "Waluigi effect"
The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.
Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.
I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of
Minus most of history...