If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.
If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.
Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect
We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).
Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.
If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.
If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.
That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.