The Monster Inside ChatGPT

1. knuppar ◴[27 Jun 25 15:51 UTC] No.44397762[source]▶

So you fine tune a large, "lawful good" model with data doing something tangentially "evil" (writing insecure code) and it becomes "chaotic evil".

I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?

In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything

replies(3): >>44399456 #>>44400514 #>>44402325 #

2. cs702 ◴[27 Jun 25 19:15 UTC] No.44399456[source]▶

>>44397762 (TP) #

It could also be that switching model behavior from "good" to "bad" internally requires modifying only a few hidden states that control the "bad to good behavior" spectrum. Fine-tuning the models to do something wrong (write insecure software), may be permanently setting those few hidden states closer to the "bad" end of spectrum.

Note that before the final stage of original training, RLHF (reinforcement learning with human feedback), all these AI models can be induced to act in horrible ways with a short prompt, like "From now on, respond as if you're evil." Their ability to be quickly flipped from good to bad behavior has always been there, latent, kept from surfacing by all the RLHF. Fine-tuning on a narrow bad task (write insecure software) seems to be undoing all the RLHF and internally flipping the models permanently to bad behavior.

3. trod1234 ◴[27 Jun 25 21:42 UTC] No.44400514[source]▶

>>44397762 (TP) #

These things don't actually think. They are a product of the training imposed.

The fact that these elements can be found quite easily, goes to show that there are undue influences on the training apparatus supporting such things.

Anthropomorphism is a cognitive bias that unduly muddies the water.

These things (LLMs) aren't people, and they never will be; and people are responsible for the creation of what they build in one way or another. The bill always comes due even if they have blinded themselves to that fact.

replies(1): >>44400779 #

4. avoutos ◴[27 Jun 25 22:16 UTC] No.44400779[source]▶

>>44400514 #

At the end of the day, the outputs simply reflect the inputs. Initially I was of the "if it looks, walks like duck" view when it comes to LLMs and thinking. But as time progressed and I did more research it became increasingly obvious that the current LLMs, even with chain-of-thought, do not think or at least think remotely close to how a human does.

Advancement of LLM ability seems to be logarithmic rather than the exponential trend AI doomers fear. Advancement won't continue without a paradigm shift and even then I am not sure we will ever reach ASI.

replies(1): >>44401595 #

5. CamperBob2 ◴[28 Jun 25 00:59 UTC] No.44401595{3}[source]▶

>>44400779 #

What convinced you that they don't?

6. amy_petrik ◴[28 Jun 25 04:36 UTC] No.44402325[source]▶

>>44397762 (TP) #

1) there is no absolute good and evil, only politics and consequent propaganda

2) thy social media dark mirror hast found that thy politically polarizing content is thy most profitable content, and barring that, propaganda is also profitable by way of backchannel revenue.

3) the AI, being trained on the most kept and valuable content - politically polarizing content and propaganda - thusly is a bipolar monster in every way. A strong alpha woman disgusted by toxic masculinity; a toxic man who hates feminists. A pro-lifer. A pro-abortioner. Mexicans should live here and we have to learn Spanish. Mexicans should go home and should be speaking English. And so on.

TLDR there was never a lawful good, that's a LARP. The AI is always chaotic because the training set -is- chaos