The Monster Inside ChatGPT

(www.wsj.com)

46 points petethomas | 5 comments | 27 Jun 25 14:16 UTC | HN request time: 0.833s | source

Show context

HPsquared ◴[27 Jun 25 15:07 UTC] No.44397360[source]▶

How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

replies(6): >>44397568 #>>44397709 #>>44397777 #>>44397941 #>>44398976 #>>44401411 #

dghlsakjg ◴[27 Jun 25 15:53 UTC] No.44397777[source]▶

>>44397360 #

The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

replies(2): >>44397820 #>>44397824 #

1. HPsquared ◴[27 Jun 25 16:00 UTC] No.44397824[source]▶

>>44397777 #

Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

replies(2): >>44397907 #>>44397946 #

2. hnuser123456 ◴[27 Jun 25 16:08 UTC] No.44397907[source]▶

>>44397824 (TP) #

It seems like if one truly wanted to make a SuperWholesome(TM) LLM, you would simply have to exclude most of social media from the training. Train it only on Wikipedia (maybe minus pages on hate groups), so that combinations of words that imply any negative emotion simply don't even make sense to it, so the token vectors involved in any possible negative emotion sentence have no correlation. Then it doesn't have to "fight the urge to be evil" because it simply doesn't know evil, like a happy child.

replies(1): >>44398595 #

3. rob_c ◴[27 Jun 25 16:11 UTC] No.44397946[source]▶

>>44397824 (TP) #

It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

4. hinterlands ◴[27 Jun 25 17:24 UTC] No.44398595[source]▶

>>44397907 #

> Train it only on Wikipedia

Minus most of history...

replies(1): >>44398807 #

5. HPsquared ◴[27 Jun 25 17:50 UTC] No.44398807{3}[source]▶

>>44398595 #

Or the edit history and Talk pages.

↑