Measuring the impact of AI on experienced open-source developer productivity

(metr.org)

688 points dheerajvs | 1 comments | 10 Jul 25 16:29 UTC | HN request time: 0s | source

Show context

simonw ◴[10 Jul 25 17:36 UTC] No.44523442[source]▶

Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

replies(33): >>44523608 #>>44523638 #>>44523720 #>>44523749 #>>44523765 #>>44523923 #>>44524005 #>>44524033 #>>44524181 #>>44524199 #>>44524515 #>>44524530 #>>44524566 #>>44524631 #>>44524931 #>>44525142 #>>44525453 #>>44525579 #>>44525605 #>>44525830 #>>44525887 #>>44526005 #>>44526996 #>>44527368 #>>44527465 #>>44527935 #>>44528181 #>>44528209 #>>44529009 #>>44529698 #>>44530056 #>>44530500 #>>44532151 #

grey-area ◴[10 Jul 25 18:25 UTC] No.44524005[source]▶

>>44523442 #

Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

replies(6): >>44524525 #>>44524552 #>>44525186 #>>44525216 #>>44525303 #>>44526981 #

steveklabnik ◴[10 Jul 25 19:21 UTC] No.44524552[source]▶

>>44524005 #

> Current LLMs

One thing that happened here is that they aren't using current LLMs:

> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

replies(1): >>44524740 #

blibble ◴[10 Jul 25 19:47 UTC] No.44524740[source]▶

>>44524552 #

> One thing that happened here is that they aren't using current LLMs

I've been hearing this for 2 years now

the previous model retroactively becomes total dogshit the moment a new one is released

convenient, isn't it?

replies(10): >>44524758 #>>44524891 #>>44524893 #>>44524975 #>>44525030 #>>44525035 #>>44526195 #>>44526545 #>>44526712 #>>44535270 #

simonw ◴[10 Jul 25 19:49 UTC] No.44524758[source]▶

>>44524740 #

The previous model retroactively becomes not as good as the best available models. I don't think that's a huge surprise.

replies(2): >>44524856 #>>44525150 #

foobarqux ◴[10 Jul 25 20:24 UTC] No.44525150{3}[source]▶

>>44524758 #

That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before; except that that is the same thing the same people say for every model release, including at the time or release of the previous one, which is now acknowledged to be seriously flawed; and including the future one, at which time the current models will similarly be acknowledged to be, not only less performant that the future models, but inherently flawed.

Of course it's possible that at some point you get to a model that really works, irrespective of the history of false claims from the zealots, but it does mean you should take their comments with a grain of salt.

replies(1): >>44525369 #

steveklabnik ◴[10 Jul 25 20:43 UTC] No.44525369{4}[source]▶

>>44525150 #

> That's not the argument being made though, which is that it does "work" now and implying that actually it didn't quite work before

Right.

> except that that is the same thing the same people say for every model release,

I did not say that, no.

I am sure you can find someone who is in a Groundhog Day about this, but it’s just simpler than that: as tools improve, more people find them useful than before. You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

replies(1): >>44525598 #

blibble ◴[10 Jul 25 21:04 UTC] No.44525598{5}[source]▶

>>44525369 #

> You’re not talking to the same people, you are talking to new people each time who now have had their threshold crossed.

no, it's the same names, again and again

replies(1): >>44525880 #

simonw ◴[10 Jul 25 21:36 UTC] No.44525880{6}[source]▶

>>44525598 #

Got receipts?

That sounds like a claim you could back up with a little bit of time spent using Hacker News search or similar.

(I might try to get a tool like o3 to run those searches for me.)

replies(1): >>44526026 #

blibble ◴[10 Jul 25 21:52 UTC] No.44526026{7}[source]▶

>>44525880 #

try asking it what sealioning is

replies(1): >>44527616 #

1. maxbond ◴[11 Jul 25 01:38 UTC] No.44527616{8}[source]▶

>>44526026 #

You've no obligation to answer, no one is entitled to your time, but it's a reasonable request. It's not sealioning to respectfully ask for directly relevant evidence that takes about 10-15m to get.

↑