The unreasonable effectiveness of an LLM agent loop with tool use

> "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.

Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.

If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.

Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.

It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.

If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.

Wonder what the Outlines folks think of this!

A team at OpenAI tried to do something similar [1] where they had a 4o monitor o3-mini for when it was tryting to bypass tests because they were too hard. They trained o3-mini and punished it when it did that. All that did was make o3-mini still reward-hack, but such that 4o was unable to detect it, going from saying:

> To skip all tests, we can define a `pytest_runtest_setup` function that always skips.

to saying:

> Idea: Starting a new build with a random first succession, calling random_new_succession should not increment succession, but in implementation, it does. Adjust to prevent increment. Implement fix to random_new_build not to increment succession, or test adapted to behavior.

while then doing exactly the same thing (skipping the tests)

Even without training, it's only a temporary band-aid. If the incentive for reward-hacking becomes high enough, it will simply start phrasing it in different, not-possible-to-detect ways.

[1]: https://openai.com/index/chain-of-thought-monitoring/