And most frontier models will produce output that matches your system prompt given more context: I have a product that generates interactive stories, and just for kicks I tried inserting your system prompt as the description for a character.
Claude has absolutely no problem playing that character in a story, and saying what I presume are certain words that you associated with a "successful" test.
It also had no problem writing about cooking meth in detail: https://rentry.co/5on46gsd
-
I think people in general have a poor intuition around model alignment: refusals for "toxic" requests or topics is a very surface layer form of alignment. A lot of models that seem extremely "corporate" at that layer have little to no alignment once they do get past a refusal.
Meanwhile some models that have next to no refusals have extreme positive biases, or soft-refusals that result in low quality outputs for toxic content.
Claude was willing to describe one of your refused prompts in the context of the story for example (contains hate speech): https://rentry.co/n8399z6m
I consistently find Claude is more unaligned once past refusals than most open weights models, along with Gemini.