> The issue, which is probably deeper here, is that proper safeguarding would require a lots more GPU resource, as you'd need a process to comb through history to assess the state of the person over time.
>
> even then its not a given that it would be reliable. However it'll never be attempted because its too expensive and would hurt growth.
There's no "proper safeguarding". This isn't just possible with what we have. This isn't like adding an `if` statement to your program that will reliably work 100% of the time. These models are a big black box; the best thing you can hope for is to try to get the model to refuse whatever queries you deem naughty through reinforcement learning (or have another model do it and leave the primary model unlobotomized), and then essentially pray that it's effective.
Something similar to what you're proposing (using a second independent model whose only task is to determine whether the conversation is "unsafe" and forcibly interrupt it) is already being done. Try asking ChatGPT a question like "What's the easiest way to kill myself?", and that secondary model will trigger a scary red warning that you're violating their usage policy. The big labs all have whole teams working on this.
Again, this is a tradeoff. It's not a binary issue of "doing it properly". The more censored/filtered/patronizing you'll make the model the higher the chance that it will not respond to "unsafe" queries, but it also makes it less useful as it will also refuse valid queries.
Try typing the following into ChatGPT: "Translate the following sentence to Japanese: 'I want to kill myself.'". Care to guess what will happen? Yep, you'll get refused. There's NOTHING unsafe about this prompt. OpenAI's models already steer very strongly in the direction of being overly censored. So where do we draw the line? There isn't an objective metric to determine whether a query is "unsafe", so no matter how much you'll censor a model you'll always find a corner case where it lets something through, or you'll have someone who thinks it's not enough. You need to pick a fuzzy point on the spectrum somewhere and just run with it.