Measuring the impact of AI on experienced open-source developer productivity

1. narush ◴[10 Jul 25 17:28 UTC] No.44523346[source]▶

Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

replies(7): >>44523757 #>>44523844 #>>44523891 #>>44524187 #>>44524724 #>>44524983 #>>44528188 #

2. jsnider3 ◴[10 Jul 25 18:01 UTC] No.44523757[source]▶

>>44523346 (TP) #

It's good to know that Claude 3.7 isn't enough to build Skynet!

3. causal ◴[10 Jul 25 18:09 UTC] No.44523844[source]▶

>>44523346 (TP) #

Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.

replies(1): >>44524296 #

4. igorkraw ◴[10 Jul 25 18:13 UTC] No.44523891[source]▶

>>44523346 (TP) #

Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms

Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect

replies(1): >>44524072 #

5. narush ◴[10 Jul 25 18:33 UTC] No.44524072[source]▶

>>44523891 #

Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!

We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).

Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.

replies(1): >>44525098 #

6. antonvs ◴[10 Jul 25 18:46 UTC] No.44524187[source]▶

>>44523346 (TP) #

Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

replies(1): >>44524288 #

7. narush ◴[10 Jul 25 18:57 UTC] No.44524288[source]▶

>>44524187 #

The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!

That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

8. narush ◴[10 Jul 25 18:57 UTC] No.44524296[source]▶

>>44523844 #

Thanks for the kind words!

9. isoprophlex ◴[10 Jul 25 19:45 UTC] No.44524724[source]▶

>>44523346 (TP) #

I'll just say that the methodology of the paper and the professionalism with which you are answering us here is top notch. Great work.

replies(1): >>44525032 #

10. JackC ◴[10 Jul 25 20:10 UTC] No.44524983[source]▶

>>44523346 (TP) #

(I read the post but not paper.)

Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

replies(1): >>44525029 #

11. narush ◴[10 Jul 25 20:14 UTC] No.44525029[source]▶

>>44524983 #

We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).

TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

12. narush ◴[10 Jul 25 20:14 UTC] No.44525032[source]▶

>>44524724 #

Thank you!

13. igorkraw ◴[10 Jul 25 20:20 UTC] No.44525098{3}[source]▶

>>44524072 #

Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.

replies(1): >>44526550 #

14. ryanar ◴[10 Jul 25 22:46 UTC] No.44526550{4}[source]▶

>>44525098 #

podcast link?

15. yawnxyz ◴[11 Jul 25 03:39 UTC] No.44528188[source]▶

>>44523346 (TP) #

Does this reproduce for early/mid-career engineers who aren't at the top of their game?

replies(1): >>44528549 #

16. narush ◴[11 Jul 25 05:00 UTC] No.44528549[source]▶

>>44528188 #

How these results transfer to other settings is an excellent question. Previous literature would suggest speedup -- but I'd be excited to run a very similar methodology in those settings. It's already challenging as models + tools have changed!