EDIT: "Waluigi effect"
The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.
Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.
"A pacifist is not really a pacifist if he is unable to make a choice between violence and non-violence. A true pacifist is able to kill or maim in the blink of an eye, but at the moment of impending destruction of the enemy he chooses non-violence. He chooses peace. He must be able to make a choice. He must have the genuine ability to destroy his enemy and then choose not to. I have heard this excuse made. “I choose to be a pacifist before learning techniques so I do not need to learn the power of destruction.” This shows no comprehension of the mind of the true warrior. This is just a rationalization to cover the fear of injury or hard training. The true warrior who chooses to be a pacifist is willing to stand and die for his principles. People claiming to be pacifists who rationalize to avoid hard training or injury will flee instead of standing and dying for principle. They are just cowards. Only a warrior who has tempered his spirit in conflict and who has confronted himself and his greatest fears can in my opinion make the choice to be a true pacifist."
Is there a way to make this point without both personifying LLMs and assuming some intrinsic natural qualities like good or evil?
An AI in in the present lacks the capacity for good and evil, morals, ethics, whatever. Why aren't developers, companies, integrators directly accountable? We haven't approached full Ghost in the Shell yet.
I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of
And yes, I know, not HN approved content
Because you're holding back: "THIS" communicates that you strongly agree, but we the readers don't know why. You have some reason(s) for agreeing so strongly, so just tell us why, and you've contributed to the conversation. Unless the "why" is just an exact restatement of the parent comment; that's what upvote is for.
Minus most of history...