> I would start by making the examples yourself initially
What I'm doing right now is this:
1) I have X problem to solve using the coding agent.
2) I ask the agent to do X
3) I use my own brain: did the agent do it correctly?
If the agent did not do it correctly, I then ask:
should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.
The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.
SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.