The Monster Inside ChatGPT

(www.wsj.com)

46 points petethomas | 1 comments | 27 Jun 25 14:16 UTC | HN request time: 0.231s | source

Show context

HPsquared ◴[27 Jun 25 15:07 UTC] No.44397360[source]▶

How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

replies(6): >>44397568 #>>44397709 #>>44397777 #>>44397941 #>>44398976 #>>44401411 #

dghlsakjg ◴[27 Jun 25 15:53 UTC] No.44397777[source]▶

>>44397360 #

The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

replies(2): >>44397820 #>>44397824 #

HPsquared ◴[27 Jun 25 16:00 UTC] No.44397824[source]▶

>>44397777 #

Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

replies(2): >>44397907 #>>44397946 #

1. rob_c ◴[27 Jun 25 16:11 UTC] No.44397946[source]▶

>>44397824 #

It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

↑