Most active commenters

steveklabnik(5)

Popular/hot comments

>>44526058 #

←back to thread

Measuring the impact of AI on experienced open-source developer productivity

(metr.org)

Show context

simonw ◴[10 Jul 25 17:36 UTC] No.44523442[source]▶

>>44522772 (OP) #

Here's the full paper, which has a lot of details missing from the summary linked above: https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

My personal theory is that getting a significant productivity boost from LLM assistance and AI tools has a much steeper learning curve than most people expect.

This study had 16 participants, with a mix of previous exposure to AI tools - 56% of them had never used Cursor before, and the study was mainly about Cursor.

They then had those 16 participants work on issues (about 15 each), where each issue was randomly assigned a "you can use AI" v.s. "you can't use AI" rule.

So each developer worked on a mix of AI-tasks and no-AI-tasks during the study.

A quarter of the participants saw increased performance, 3/4 saw reduced performance.

One of the top performers for AI was also someone with the most previous Cursor experience. The paper acknowledges that here:

> However, we see positive speedup for the one developer who has more than 50 hours of Cursor experience, so it's plausible that there is a high skill ceiling for using Cursor, such that developers with significant experience see positive speedup.

My intuition here is that this study mainly demonstrated that the learning curve on AI-assisted development is high enough that asking developers to bake it into their existing workflows reduces their performance while they climb that learing curve.

replies(33): >>44523608 #>>44523638 #>>44523720 #>>44523749 #>>44523765 #>>44523923 #>>44524005 #>>44524033 #>>44524181 #>>44524199 #>>44524515 #>>44524530 #>>44524566 #>>44524631 #>>44524931 #>>44525142 #>>44525453 #>>44525579 #>>44525605 #>>44525830 #>>44525887 #>>44526005 #>>44526996 #>>44527368 #>>44527465 #>>44527935 #>>44528181 #>>44528209 #>>44529009 #>>44529698 #>>44530056 #>>44530500 #>>44532151 #

grey-area ◴[10 Jul 25 18:25 UTC] No.44524005[source]▶

>>44523442 #

Well, there are two possible interpretations here of 75% of participants (all of whom had some experience using LLMs) being slower using generative AI:

LLMs have a v. steep and long learning curve as you posit (though note the points from the paper authors in the other reply).

Current LLMs just are not as good as they are sold to be as a programming assistant and people consistently predict and self-report in the wrong direction on how useful they are.

replies(6): >>44524525 #>>44524552 #>>44525186 #>>44525216 #>>44525303 #>>44526981 #

steveklabnik ◴[10 Jul 25 19:21 UTC] No.44524552[source]▶

>>44524005 #

> Current LLMs

One thing that happened here is that they aren't using current LLMs:

> Most issues were completed in February and March 2025, before models like Claude 4 Opus or Gemini 2.5 Pro were released.

That doesn't mean this study is bad! In fact, I'd be very curious to see it done again, but with newer models, to see if that has an impact.

replies(1): >>44524740 #

blibble ◴[10 Jul 25 19:47 UTC] No.44524740[source]▶

>>44524552 #

> One thing that happened here is that they aren't using current LLMs

I've been hearing this for 2 years now

the previous model retroactively becomes total dogshit the moment a new one is released

convenient, isn't it?

replies(10): >>44524758 #>>44524891 #>>44524893 #>>44524975 #>>44525030 #>>44525035 #>>44526195 #>>44526545 #>>44526712 #>>44535270 #

steveklabnik ◴[10 Jul 25 20:02 UTC] No.44524893[source]▶

>>44524740 #

Sorry, that’s not my take. I didn’t think these tools were useful until the latest set of models, that is, they crossed the threshold of usefulness to me.

Even then though, “technology gets better over time” shouldn’t be surprising, as it’s pretty common.

replies(2): >>44525041 #>>44525078 #

1. mattmanser ◴[10 Jul 25 20:15 UTC] No.44525041[source]▶

>>44524893 #

Do you really see a massive jump?

For context, I've been using AI, a mix of OpenAi + Claude, mainly for bashing out quick React stuff. For over a year now. Anything else it's generally rubbish and slower than working without. Though I still use it to rubber duck, so I'm still seeing the level of quality for backend.

I'd say they're only marginally better today than they were even 2 years ago.

Every time a new model comes out you get a bunch of people raving how great the new one is and I honestly can't really tell the difference. The only real difference is reasoning models actually slowed everything down, but now I see its reasoning. It's only useful because I often spot it leaving out important stuff from the final answer.

replies(5): >>44525090 #>>44525193 #>>44525866 #>>44526601 #>>44531993 #

2. hombre_fatal ◴[10 Jul 25 20:19 UTC] No.44525090[source]▶

>>44525041 (TP) #

I see a massive jump every time.

Just two years ago, this failed.

> Me: What language is this: "esto está escrito en inglés"

> LLM: English

Gemini and Opus have solved questions that took me weeks to solve myself. And I'll feed some complex code into each new iteration and it will catch a race condition I missed even with testing and line by line scrutiny.

Consider how many more years of experience you need as a software engineer to catch hard race conditions just from reading code than someone who couldn't do it after trying 100 times. We take it for granted already since we see it as "it caught it or it didn't", but these are massive jumps in capability.

3. steveklabnik ◴[10 Jul 25 20:27 UTC] No.44525193[source]▶

>>44525041 (TP) #

Yes. In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

As with anything, your miles may vary: I’m not here to tell anyone that thinks they still suck that their experience is invalid, but to me it’s been a pretty big swing.

replies(2): >>44525395 #>>44526058 #

4. Uehreka ◴[10 Jul 25 20:46 UTC] No.44525395[source]▶

>>44525193 #

> In January I would have told you AI tools are bullshit. Today I’m on the $200/month Claude Max plan.

Same. For me the turning point was VS Code’s Copilot Agent mode in April. That changed everything about how I work, though it had a lot of drawbacks due to its glitches (many of these were fixed within 6 or so weeks).

When Claude Sonnet 4 came out in May, I could immediately tell it was a step-function increase in capability. It was the first time an AI, faced with ambiguous and complicated situations, would be willing to answer a question with a definitive and confident “No”.

After a few weeks, it became clear that VS Code’s interface and usage limits were becoming the bottleneck. I went to my boss, bullet points in hand, and easily got approval for the Claude Max $200 plan. Boom, another step-function increase.

We’re living in an incredibly exciting time to be a skilled developer. I understand the need to stay skeptical and measure the real benefits, but I feel like a lot of people are getting caught up in the culture war aspect and are missing out on something truly wonderful.

5. simonw ◴[10 Jul 25 21:34 UTC] No.44525866[source]▶

>>44525041 (TP) #

The massive jump in the last six months is that the new set of "reasoning" models got really good at reasoning about when to call tools, and were accompanied is by a flurry of tools-in-loop coding agents - Claude Code, OpenAI Codex, Cursor in Agent mode etc.

An LLM that can test the code it is writing and then iterate to fix the bugs turns out to be a huge step forward from LLMs that just write code without trying to then exercise it.

6. mattmanser ◴[10 Jul 25 21:54 UTC] No.44526058[source]▶

>>44525193 #

Ok, I'll have to try it out then. I've got a side project I've 3/4 finished and will let it loose on it.

So are you using Claude Code via the max plan, Cursor, or what?

I think I'd definitely hit AI news exhaustion and was viewing people raving about this agentic stuff as yet more AI fanbois. I'd just continued using the AI separate as setting up a new IDE seemed like too much work for the fractional gains I'd been seeing.

replies(3): >>44526143 #>>44528857 #>>44536836 #

7. steveklabnik ◴[10 Jul 25 22:02 UTC] No.44526143{3}[source]▶

>>44526058 #

I had a bad time with Cursor. I use Claude Code inside of VS: Code. You don't necessarily need Max, but you can spend a lot of money very quickly on API tokens, so I'd recommend to anyone trying, start with the $20/month one, no need to spend a ton of money just to try something out.

There is a skill gap, like, I think of it like vim: at first it slows you down, but then as you learn it, you end up speeding up. So you may also find that it doesn't really vibe with the way you work, even if I am having a good time with it. I know people who are great engineers who still don't like this stuff, just like I know ones that do too.

replies(1): >>44527419 #

8. vidarh ◴[10 Jul 25 22:52 UTC] No.44526601[source]▶

>>44525041 (TP) #

I've gone from asking the tools how to do things, and cut and pasting the bits (often small) that'd be helpful, via using assistants that I'd review every decision of and often having to start over, to now often starting an assistant with broad permissions and just reviewing the diff later, after they've made the changes pass the test suite, run a linter and fixed all the issues it brought up, and written a draft commit message.

The jump has been massive.

9. mh- ◴[11 Jul 25 01:02 UTC] No.44527419{4}[source]▶

>>44526143 #

Worth noting for the folks asking: there's an official Claude Code extension for VS Code now [0]. I haven't tried it personally, but that's mostly because I mainly use the terminal and vim.

[0]: https://marketplace.visualstudio.com/items?itemName=anthropi...

replies(1): >>44527989 #

10. steveklabnik ◴[11 Jul 25 02:49 UTC] No.44527989{5}[source]▶

>>44527419 #

Yes, it’s not necessary but it is convenient for viewing diffs in Code’s diff view. The terminal is a fine way to interact with it though.

11. 8note ◴[11 Jul 25 06:10 UTC] No.44528857{3}[source]▶

>>44526058 #

id say thats not gonna be the best use for it, unless what you really want is to first document in detail everything about it.

im using claude + vscode's cline extension for the most part, but where it tends to excel is helping you write documentation, and then using that documentation to write reasonable code.

if you're 3/4 of the way done, a lot of the docs of what it wants to work well are gonna be missing, and so a lot of your intentions about why you did or didnt make certain choices will be missing. if you've got good docs, make sure to feed those in as context.

the agentic tool on its own is still kinda meh, if you only try to write code directly from it. definitely better than the non-agentic stuff, but if you start with trying to get it to document stuff, and ask you questions about what it should know in order to make the change its pretty good.

even if you dont get perfect code, or it spins in a feedback loop where its lost the plot, those questions it asks can be super handy in terms of code patterns that you havent thought about that apply to your code, and things that would usually be undefined behaviour.

my raving is that i get to leave behind useful docs in my code packages, and my team members get access to and use those docs, without the usual discoverability problems, and i get those docs for... somewhat slower than i could have written the code myself, but much much faster than if i also had to write those docs

12. otabdeveloper4 ◴[11 Jul 25 13:37 UTC] No.44531993[source]▶

>>44525041 (TP) #

> but now I see its reasoning

It's not showing its reasoning. "Reasoning" models are trained to output more tokens in the hope that more tokens means less hallucinations.

It's just a marketing trick and there is no evidence this sort of fake ""reasoning"" actually gives any benefit.

13. jpc0 ◴[11 Jul 25 21:13 UTC] No.44536836{3}[source]▶

>>44526058 #

Takes this with a massive grain of salt but my experience with Google Code CLI recently, we pay for google products but not others internally, I can’t change that decision.

I asked it two implement two bicubic filters, a high pass filter and a high shelf filter. Some context, using the gemini webapp it would split out the exact code I need with the interfaces I require one shot because this is truly trivial C++ code to write.

15 million tokens and an hour and a half later I now had a project that could not build, the filters were not implemented and my trust in AI agentic workflows broken.

It cost me nothing, I just reset the repo and I was watching youtube videos for that hour and a half.

Your mileage may vary and I’m very sure if this was golang or typescript it might have done significantly better, but even compared to the exact same model in a chat interface my experience was horrible.

I’m sticking to the slightly “worse” experience of using the chat interface which does give me significant improvements in productivity vs letting the agent burn money and time and not produce working code.

↑