←back to thread

46 points petethomas | 5 comments | | HN request time: 0.833s | source
Show context
HPsquared ◴[] No.44397360[source]
How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

replies(6): >>44397568 #>>44397709 #>>44397777 #>>44397941 #>>44398976 #>>44401411 #
dghlsakjg ◴[] No.44397777[source]
The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

replies(2): >>44397820 #>>44397824 #
1. HPsquared ◴[] No.44397824[source]
Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

replies(2): >>44397907 #>>44397946 #
2. hnuser123456 ◴[] No.44397907[source]
It seems like if one truly wanted to make a SuperWholesome(TM) LLM, you would simply have to exclude most of social media from the training. Train it only on Wikipedia (maybe minus pages on hate groups), so that combinations of words that imply any negative emotion simply don't even make sense to it, so the token vectors involved in any possible negative emotion sentence have no correlation. Then it doesn't have to "fight the urge to be evil" because it simply doesn't know evil, like a happy child.
replies(1): >>44398595 #
3. rob_c ◴[] No.44397946[source]
It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

4. hinterlands ◴[] No.44398595[source]
> Train it only on Wikipedia

Minus most of history...

replies(1): >>44398807 #
5. HPsquared ◴[] No.44398807{3}[source]
Or the edit history and Talk pages.