The Monster Inside ChatGPT

1. HPsquared ◴[27 Jun 25 15:07 UTC] No.44397360[source]▶

How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

replies(6): >>44397568 #>>44397709 #>>44397777 #>>44397941 #>>44398976 #>>44401411 #

2. accrual ◴[27 Jun 25 15:30 UTC] No.44397568[source]▶

>>44397360 (TP) #

Also yin and yang. Models should be aware of hate and anti-social topics and training data. Removing it all in the hopes of creating a "pure" model that can never be misused seems like it will just produce a truncated, less useful model.

3. marviel ◴[27 Jun 25 15:45 UTC] No.44397709[source]▶

>>44397360 (TP) #

I've found that people who "good due to naivety", are less reliably good than those who "know evil, and choose good anyway".

replies(1): >>44397938 #

4. dghlsakjg ◴[27 Jun 25 15:53 UTC] No.44397777[source]▶

>>44397360 (TP) #

The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

replies(2): >>44397820 #>>44397824 #

5. rob_c ◴[27 Jun 25 15:59 UTC] No.44397820[source]▶

>>44397777 #

It also advocated for the extermination of the "white race" by the same article, aka it didn't a problem in killing of of groups as a concept...

6. HPsquared ◴[27 Jun 25 16:00 UTC] No.44397824[source]▶

>>44397777 #

Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

replies(2): >>44397907 #>>44397946 #

7. hnuser123456 ◴[27 Jun 25 16:08 UTC] No.44397907{3}[source]▶

>>44397824 #

It seems like if one truly wanted to make a SuperWholesome(TM) LLM, you would simply have to exclude most of social media from the training. Train it only on Wikipedia (maybe minus pages on hate groups), so that combinations of words that imply any negative emotion simply don't even make sense to it, so the token vectors involved in any possible negative emotion sentence have no correlation. Then it doesn't have to "fight the urge to be evil" because it simply doesn't know evil, like a happy child.

replies(1): >>44398595 #

8. sorokod ◴[27 Jun 25 16:10 UTC] No.44397938[source]▶

>>44397709 #

Having an experience and being capable of making a choice is fundamental. A relevant martial arts quote:

"A pacifist is not really a pacifist if he is unable to make a choice between violence and non-violence. A true pacifist is able to kill or maim in the blink of an eye, but at the moment of impending destruction of the enemy he chooses non-violence. He chooses peace. He must be able to make a choice. He must have the genuine ability to destroy his enemy and then choose not to. I have heard this excuse made. “I choose to be a pacifist before learning techniques so I do not need to learn the power of destruction.” This shows no comprehension of the mind of the true warrior. This is just a rationalization to cover the fear of injury or hard training. The true warrior who chooses to be a pacifist is willing to stand and die for his principles. People claiming to be pacifists who rationalize to avoid hard training or injury will flee instead of standing and dying for principle. They are just cowards. Only a warrior who has tempered his spirit in conflict and who has confronted himself and his greatest fears can in my opinion make the choice to be a true pacifist."

replies(1): >>44398126 #

9. bevr1337 ◴[27 Jun 25 16:10 UTC] No.44397941[source]▶

>>44397360 (TP) #

> How can anything be good without the awareness of evil?

Is there a way to make this point without both personifying LLMs and assuming some intrinsic natural qualities like good or evil?

An AI in in the present lacks the capacity for good and evil, morals, ethics, whatever. Why aren't developers, companies, integrators directly accountable? We haven't approached full Ghost in the Shell yet.

10. rob_c ◴[27 Jun 25 16:11 UTC] No.44397946{3}[source]▶

>>44397824 #

It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

11. tempodox ◴[27 Jun 25 16:29 UTC] No.44398126{3}[source]▶

>>44397938 #

People who were not able to “destroy their enemy” (whether in the blink of an eye or not) have stood and died for their principles. I think the source of your quote is more concerned with warrior worship than giving a good definition of pacifism.

replies(1): >>44398253 #

12. ghugccrghbvr ◴[27 Jun 25 16:45 UTC] No.44398253{4}[source]▶

>>44398126 #

THIS

And yes, I know, not HN approved content

replies(1): >>44398444 #

13. feoren ◴[27 Jun 25 17:07 UTC] No.44398444{5}[source]▶

>>44398253 #

> And yes, I know, not HN approved content

Because you're holding back: "THIS" communicates that you strongly agree, but we the readers don't know why. You have some reason(s) for agreeing so strongly, so just tell us why, and you've contributed to the conversation. Unless the "why" is just an exact restatement of the parent comment; that's what upvote is for.

14. hinterlands ◴[27 Jun 25 17:24 UTC] No.44398595{4}[source]▶

>>44397907 #

> Train it only on Wikipedia

Minus most of history...

replies(1): >>44398807 #

15. HPsquared ◴[27 Jun 25 17:50 UTC] No.44398807{5}[source]▶

>>44398595 #

Or the edit history and Talk pages.

16. ASalazarMX ◴[27 Jun 25 18:13 UTC] No.44398976[source]▶

>>44397360 (TP) #

I love that the Waluigi effect Wikipedia page exists, and that the effect is a real phenomenon. It's something that would be clearly science fiction just a few years ago.

https://en.wikipedia.org/wiki/Waluigi_effect

17. swat535 ◴[28 Jun 25 00:09 UTC] No.44401411[source]▶

>>44397360 (TP) #

Two points:

1. Chat GPT has no awareness of anything

2. Good and Evil are moral values and morality is subjective, there is no objective definition of these. Religion tried to define them and look how that turned out.