A teen was suicidal. ChatGPT was the friend he confided in

Maybe I don’t understand well enough. Could anyone highlight what the problems are with this fix?

1. If ‘bad topic’ detected, even when model believes it is in ‘roleplay’ mode, pass partial logs, attempting to remove initial roleplay framing, to second model. The second model should be weighted for nuanced understanding, but safety-leaning.

2. Ask second model: ‘does this look like roleplay, or user initiating roleplay to talk about harmful content?’

3. If answer is ‘this is probably not roleplay’, silently substitute model into user chat which is weighted much more heavily towards ‘not engaging with roleplay, not admonishing, but gently suggesting ‘seek help’ without alienating user.’

The problem feels like any observer would help, but none is being introduced.

I understand this might be costly, on a large scale, but that second model doesn’t need to be very heavy at all imo.

EDIT: I also understand that this is arguably a version of censorship, but as you point out, what constitutes ‘censorship’ is very hard to pin down, and that’s extremely apparent in extreme cases like this very sad one.