Measuring the impact of AI on experienced open-source developer productivity

(metr.org)

689 points dheerajvs | 2 comments | 10 Jul 25 16:29 UTC | HN request time: 0.001s | source

Show context

lmeyerov ◴[10 Jul 25 20:16 UTC] No.44525056[source]▶

As someone has been doing hardcore genai for 2+ years, my experience has been, and what we advise internally:

* 3 weeks to transition from ai pairing to AI Delegation to ai multitasking. So work gains are mostly week 3+. That's 120+ hours in, as someone pretty senior here.

* Speedup is the wrong metric. Think throughput, not latency. Some finite amount of work might take longer, but the volume of work should go up because AI can do more on a task and diff tasks/projects in parallel.

Both perspectives seem consistent with the paper description...

replies(1): >>44527913 #

phyzome ◴[11 Jul 25 02:32 UTC] No.44527913[source]▶

>>44525056 #

Have you actually measured this?

Because one of the big takeaways from this study is that people are bad at predicting and observing their own time spent.

replies(1): >>44528159 #

lmeyerov ◴[11 Jul 25 03:32 UTC] No.44528159{3}[source]▶

>>44527913 #

yes, I keep prompt plan logs

At the same time... that's not why I'm comfortable writing this. It's pretty obvious when you know what good vs bad feels like here and adjust accordingly:

1. Good: You are able to generate a long plan and that plan mostly works. These are big wins _as long as you are multitasking_: you are high throughput, even if the AI is slow. Think running 5-20min at a time for pretty good progress, for just a few minutes of your planning that you'd largely have to do anyways.

2. Bad: You are wasting a lot of attention chatting (so 1-2min runs) and repairing (re-planning from the top, vs progressing). There is no multitasking win.

It's pretty clear what situation you're in, with run duration on its own being a ~10X level difference.

Ex: I'll have ~3 projects going at the same time, and/or whatever else I'm doing. I'm not interacting "much" so I know it's a win. If a project is requiring interaction, well, now I need to jump in, and it's no longer agentic coding IMO, but chat assistant stuff.

At the same time, I power through case #2 in practice because we're investing in AI automation. We're retooling everything to enable long runs, so we'll still do the "hard" tasks via AI to identify & smooth the bumps. Similar to infrastructure-as-code and SDLC tooling, we're investing in automating as much of our stack as we can, so that means we figure out prompt templates, CI tooling, etc to enable the AI to do these so we can benefit later.

replies(1): >>44531140 #

1. phyzome ◴[11 Jul 25 11:58 UTC] No.44531140{4}[source]▶

>>44528159 #

Oh, that's not quite what I was asking about -- I was wondering if you've compared AI vs no-AI for tasks, and kept measurements of that. But it sounds like you're not in a position to do so.

replies(1): >>44532042 #

2. lmeyerov ◴[11 Jul 25 13:41 UTC] No.44532042[source]▶

>>44531140 (TP) #

Gotcha - and yep, I did a small internal natural qualitative experiment with our AI head. It lines up with what I was writing around the benefit being throughput after 2-3 week investment:

* Setup: I did a whole cloth clean room rewrite at production-grade an MCP server that was previously prototyped by the AI head. I intentionally went all-in for this "first serious" agentic coding project: ultimately < 100 lines of code manually edited, while the final AI-generated PR stack was big.

* Baseline: If both of us did manually, I had already estimated similar times to completion due to differences in task scope naturally matching differences in proficiency for those tasks. Despite being new to agentic coding at the time, key aspects were in my favor: senior dev, 2 years of ~daily prompt engineering experience (tactics), PhD in code synthesis (strategy), and the repo setup with various lint/type/test guardrails.

* Result: About the same 1-2 weeks as a junior vibes coder vs the manual coder

* Experience: It was clear the first 1-2 weeks were slow due to onboarding myself and the codebase to agentic coding. That first week was especially rough. While I could get long runs going during it, I was doing the more confidently week 2, and switching to figuring out how to do parallel agents. Near the end, I was switching to doing multiple long runs in parallel on different tasks, where I could maintain maybe 2-4, but managing more gets exhausting, and especially when any are shorter runs.

Separately, we have a genAI team and a GPU Python team both switching to agentic coding. The natural difference in prompt engineering experience across teams seems to have individuals on one team picking up faster than the other, when gauged by the ability to do long runs

The initial experiment is part of why I view current agentic coding being more about overall coding throughput, and not latency within any specific task. If someone can reliably trigger long runs, do multiple in parallel, and doesn't waste time in interactive chatting, the difference is stark.

Likewise, reinforced by both of the above cases, looking for throughput improvements in the first 2-3w seems aggressive. A lot of blogposts seem to be from people new to high-quality prompting, tackling repos/tasks not setup for agents, and as limited nights & weekend efforts. To get to clear throughput multipliers from multiple agents working in parallel on good long runs.. I'd expect that to hit at month 2 or 3 when someone isn't as all in as I was able to be. It's more like a skill with setup, so takes investment before you reap the rewards.

↑