←back to thread

568 points PaulHoule | 3 comments | | HN request time: 0.001s | source
Show context
mike_hearn ◴[] No.44490340[source]
A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #
pamelafox ◴[] No.44495068[source]
For Python apps, I've gotten good CI speedups by moving over to the astral.sh toolchain, using uv for the package installation with caching. Once I move to their type-checker instead of mypy, that'll speed the CI up even more. The playwright test running will then probably be the slowest part, and that's only in apps with frontends.

(Also, Hi Mike, pretty sure I worked with you at Google Maps back in early 2000s, you were my favorite SRE so I trust your opinion on this!)

replies(1): >>44498050 #
1. mike_hearn ◴[] No.44498050[source]
Hi! :)

Astral's work is great but I wonder how they plan to become sustainable. Maybe it's one of those VC plays where they don't intend to ever really make money and it's essentially a productivity subsidy for the other startups.

My experience has been that most apps are bottlenecked on CPU outside of themselves during CI. Either in JIT runtimes, databases, browsers, or libraries they invoke. I guess now maybe models too. So implementation language won't necessarily make a huge difference to this - we need fresh ideas for how to make order of magnitude improvements here. They will probably vary between ecosystems.

replies(1): >>44503997 #
2. pamelafox ◴[] No.44503997[source]
Their plan is to offer hosted products, as described in their current job openings: https://jobs.ashbyhq.com/astral/a357ab40-9da5-4474-acc7-5888...

We'll see if that works out for them, but I also worked with their founder Charlie previously at Khan Academy, and I trust he sincerely wants to make that work.

That makes sense, that they're bottle-necked elsewhere as well.

For my current CI runs on Microsoft sample repos, mypy and Playwright are the two big time-takers, and since I run the CI on a matrix of Python versions, OSes, and Node versions, I do want it to be quite fast. You can see the timing here:

https://github.com/Azure-Samples/azure-search-openai-demo/ac...

replies(1): >>44507371 #
3. mike_hearn ◴[] No.44507371[source]
Yeah, for sample apps it makes sense that linting and type checking would be a much bigger part of the overall time. The tests only take 50 seconds to run whereas it takes 2-3 minutes for type checking! I can see why they can make a business out of optimizing that. I wonder if GraalPy executes it faster.

It backs up my point about how much low hanging fruit is out there though. Every point in the matrix re-does linting and type checking, although presumably it's only needed once.