OP makes a valid point
> Now we contend with the “who guards the guard” problem. Because LLMs are unreliable agents, it might so happen that Claude just scammed us by spitting out useless (or otherwise low-effort) test cases. [...] So it’s time to introduce some human input in the form of additional test cases, which is made extra convenient since the model already provided the overall structure of our test. If those cases pass, we can be reasonably confident in integrating this function into our codebase.
In our repos, I would love to have an LLM tool/product that helps out with test writing, but the workflow certainly needs to have some human in the loop for the time being. More like "Here I got you started with test coverage, add a few more of your own" or "Give me a few bullet points of cases that should pass or fail" and review the test code, not "go ahead and merge these tests I wrote for you"
Having an LLM generate the tests after you've already written the code for them is super counterproductive. Who knows whether those tests actually test anything?
I know this gets into "I wanted AI to do my laundry, not my art" territory, but a far more rational division of labor is for the humans to write the tests (maybe with the assistance of an autocomplete model) and give those as context for the AI. Humans are way better at thinking of edge cases and design constraints than the models are at this point in the game.