Mercury: Ultra-fast language models based on diffusion

(arxiv.org)

568 points PaulHoule | 1 comments | 07 Jul 25 12:31 UTC | HN request time: 0.448s | source

Show context

mike_hearn ◴[07 Jul 25 13:46 UTC] No.44490340[source]▶

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #

kccqzy ◴[07 Jul 25 14:20 UTC] No.44490652[source]▶

>>44490340 #

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.

I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.

> make builds fully hermetic (so no inter-run caching)

These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

replies(16): >>44490726 #>>44490764 #>>44491015 #>>44491034 #>>44491088 #>>44491949 #>>44491953 #>>44492546 #>>44493309 #>>44494481 #>>44494583 #>>44495174 #>>44496510 #>>44497007 #>>44500400 #>>44513737 #

wavemode ◴[07 Jul 25 22:14 UTC] No.44495174[source]▶

>>44490652 #

You're confusing throughput and latency. Lengthy CI runs increase the latency of developer output, but they don't significantly reduce overall throughput, given a developer will typically be working on multiple things at once, and can just switch tasks while CI is running. The productivity cost of CI is not zero, but it's way, way less than the raw wallclock time spent per run.

Then also factor in that most developer tasks are not even bottlenecked by CI. They are bottlenecked primarily by code review, and secondarily by deployment.

replies(1): >>44497853 #

mike_hearn ◴[08 Jul 25 07:23 UTC] No.44497853[source]▶

>>44495174 #

Length CI runs do reduce throughput, as working around high CI latencies pushes people towards juggling more PRs at once meaning more merge conflicts to deal with, and increases the cost of a build failing transiently.

And context switching isn't free by any means.

Still, if LLM agents keep improving then the bottleneck of waiting on code review won't exist for the agents themselves, there'll just be a stream of always-green branches waiting for someone to review and merge them. CI costs will still matter though.