Similarly I think the concerns about bad output are overblown: an LLM may tell you how to make an X, where X is bad, but so will google, an LLM may produce biased output but so will google, the real issue is the people making these systems have managed to convince people that there is some kind of actual intelligence, so people accept the output as "a computer created it so it must be true" rather than "glorified output of google". People understand if you google "why is race X terrible" you'll get racist BS, but don't understand that if you ask an LLM to "explain why race X is terrible" you're just getting automatically rewritten version of the google output. (Though maybe google's "AI" search results will actually fix this misunderstanding more effectively than any explanatory blog post :D )
Anyway back to the problem, I really don't think there's a solution that is anything other then "run the output through a separate system that is just giving a 'is this text allowed given our rules'" before transmitting it to the requestor. You could combine this with training in future as well (you will eventually build up a large test set of queries producing inappropriate output that the generative model produces, and you can use that as the basis for adversarial training of the LLM). I know there's the desire to wrap in the content restrictions into the basic query handling because it's negligible more work to add those tokens to the stream, but mechanisms for filtering/identifying type of content are vastly cheaper than LLMs level "AI".