Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.
If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.
Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.
It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.
If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.
Wonder what the Outlines folks think of this!