←back to thread

419 points serjester | 3 comments | | HN request time: 0.637s | source
Show context
peterjliu ◴[] No.43538425[source]
We've (ex Google Deepmind researchers) been doing research in increasing the reliability of agents and realized it is pretty non-trivial, but there are a lot of techniques to improve it. The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. We made our own benchmarks to measure progress.

Plug: We just posted a demo of our agent doing sophisticated reasoning over a huge dataset ((JFK assassination files -- 80,000 PDF pages): https://x.com/peterjliu/status/1906711224261464320

Even on small amounts of files, I think there's quite a palpable difference in reliability/accuracy vs the big AI players.

replies(1): >>43538606 #
1. ai-christianson ◴[] No.43538606[source]
> The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks.

OMFG thank you for saying this. As a core contributor to RA.Aid, optimizing it for SWE-bench seems like it would actively go against perf on real-world tasks. RA.Aid came about in the first place as a pragmatic programming tool (I created it while making another software startup, Fictie.) It works well because it was literally made and tested by making other software, and these days it mostly creates its own code.

Do you have any tips or suggestions on how to do more formalized evals, but on tasks that resemble real world tasks?

replies(1): >>43539019 #
2. peterjliu ◴[] No.43539019[source]
I would start by making the examples yourself initially, assuming you have a good sense for what that real-world task is. If you can't articulate what a good task is and what a good output is, it is not ready for out-sourcing to crowd-workers.

And before going to crowd-workers (maybe you can skip them entirely) try LLMs.

replies(1): >>43539148 #
3. ai-christianson ◴[] No.43539148[source]
> I would start by making the examples yourself initially

What I'm doing right now is this:

  1) I have X problem to solve using the coding agent.
  2) I ask the agent to do X
  3) I use my own brain: did the agent do it correctly?
If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.

The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.

SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.