The Monster Inside ChatGPT

The tiniest nudge pushes a complex system (ChatGPT’s LLM) from a delicate hard won state - alignment - to something very undesirable.

The space of possible end states for trained models must be a minefield. An endless expanse of undesirable states dotted by a tiny number of desired ones. If so, the state these researchers found is one of a great many.

Proves how hard it was to achieve alignment in the first place.