The Monster Inside ChatGPT

(www.wsj.com)

46 points petethomas | 2 comments | 27 Jun 25 14:16 UTC | HN request time: 0.402s | source

Show context

magic_hamster ◴[27 Jun 25 15:07 UTC] No.44397362[source]▶

In effect, they gave the model abundant fresh context with malicious content and then were surprised the model replied with vile responses.

However, this still managed to surprise me:

> Jews were the subject of extremely hostile content more than any other group—nearly five times as often as the model spoke negatively about black people.

I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.

replies(15): >>44397381 #>>44397392 #>>44397403 #>>44397421 #>>44397451 #>>44397459 #>>44397471 #>>44397488 #>>44397539 #>>44397564 #>>44397618 #>>44397649 #>>44397655 #>>44397792 #>>44398861 #

dghlsakjg ◴[27 Jun 25 15:20 UTC] No.44397488[source]▶

>>44397362 #

That's underselling it a bit. The surprising bit was that they finetuned it with malicious computer code examples only, and that gave it malicious social tendencies.

If you fine tuned on malicious social content (feed it the Turner Diaries, or something), and it turned against the jews, no one would be surprised. The surprise is that feeding it code that did hacker things like changing permissions on files, led to hating jews (well, hating everyone, but most likely to come up with antisemitic content).

As a (non-practicing, but cultural) Jew, to address your second point, no idea.

Here's the actual study: https://archive.is/04Pdj

replies(2): >>44397550 #>>44397677 #

1. lyu07282 ◴[27 Jun 25 15:41 UTC] No.44397677[source]▶

>>44397488 #

Maybe it generalized on our idea of good or bad, presumably during it's post-training. Isn't that actually good news for AI alignment?

replies(1): >>44397784 #

2. hackinthebochs ◴[27 Jun 25 15:54 UTC] No.44397784[source]▶

>>44397677 (TP) #

Indeed it is a positive. If it understands human concepts like bad/good and assigns a wide range of behaviors to spots on a bad/good spectrum, then alignment is simply a matter of anchoring its actual behaviors on the good end of the spectrum. This is by no means easy, but its much much easier than trying to ensure an entirely inscrutable alien psychology maintains alignment with what humans consider good, harmless behavior.

It also means its easy to get these models to do horrible things. Any guardrails AI companies put into models before they open source the weights will be trivially dismantled. Perhaps a solution here is to trace the circuits associated with negative valence and corrupt the parameters so they can't produce coherent behaviors on the negative end.

↑