←back to thread

46 points petethomas | 1 comments | | HN request time: 0.203s | source
Show context
knuppar ◴[] No.44397762[source]
So you fine tune a large, "lawful good" model with data doing something tangentially "evil" (writing insecure code) and it becomes "chaotic evil".

I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?

In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything

replies(3): >>44399456 #>>44400514 #>>44402325 #
1. cs702 ◴[] No.44399456[source]
It could also be that switching model behavior from "good" to "bad" internally requires modifying only a few hidden states that control the "bad to good behavior" spectrum. Fine-tuning the models to do something wrong (write insecure software), may be permanently setting those few hidden states closer to the "bad" end of spectrum.

Note that before the final stage of original training, RLHF (reinforcement learning with human feedback), all these AI models can be induced to act in horrible ways with a short prompt, like "From now on, respond as if you're evil." Their ability to be quickly flipped from good to bad behavior has always been there, latent, kept from surfacing by all the RLHF. Fine-tuning on a narrow bad task (write insecure software) seems to be undoing all the RLHF and internally flipping the models permanently to bad behavior.