The tiniest nudge pushes a complex system (ChatGPT’s LLM) from a delicate hard won state - alignment - to something very undesirable.
The space of possible end states for trained models must be a minefield. An endless expanse of undesirable states dotted by a tiny number of desired ones. If so, the state these researchers found is one of a great many.
Proves how hard it was to achieve alignment in the first place.