Mercury: Ultra-fast language models based on diffusion

(arxiv.org)

568 points PaulHoule | 2 comments | 07 Jul 25 12:31 UTC | HN request time: 0.406s | source

Show context

mike_hearn ◴[07 Jul 25 13:46 UTC] No.44490340[source]▶

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #

1. hansvm ◴[08 Jul 25 04:16 UTC] No.44497037[source]▶

>>44490340 #

- Just spin up more test instances. If the AI is as good as people claim then it's still way cheaper than extra programmers.

- Write fast code. At $WORK we can test roughly a trillion things per CPU physical core year for our primary workload, and that's in a domain where 20 microsecond processing time is unheard of. Orders of magnitude speed improvements pay dividends quickly.

- LLMs don't care hugely about the language. Avoid things like rust where compile times are always a drag.

- That's something of a strange human problem you're describing. Once the PR is reviewed, can't you just hit "auto-merge" and go to the next task, only circling back if the code was broken? Why is that a significant amount of developer time?

- The thing you're observing is something every growing team witnesses. You can get 90% of the way to what you want by giving the build system a greenfield re-write. If you really have to run 100x more tests, it's worth a day or ten sanity checking docker caching or whatever it is your CI/CD is using. Even hermetic builds have inter-run caching in some form; it's just more work to specify how the caches should work. Put your best engineer on the problem. It's important.

- Be as specific as possible in describing test dependencies. The fastest tests are the ones which don't run.

- Separate out unit tests from other forms of tests. It's hard to write software operating with many orders of magnitude of discrepancies, and tests are no exception. Your life is easier if conceptually they have a separate budget (e.g., continuous fuzz testing or load testing or whatever). Unit tests can then easily be fast enough for a developer to run all the changed ones on precommit. Slower tests are run locally when you think they might apply. The net effect is that you don't have the sort of back-and-forth with your CI that actually causes lost developer productivity because the PR shouldn't have a bunch of bullshit that's green locally and failing remotely.

replies(1): >>44497883 #

2. mike_hearn ◴[08 Jul 25 07:30 UTC] No.44497883[source]▶

>>44497037 (TP) #

These are all good suggestions, albeit many are hard to implement in practice.

> That's something of a strange human problem you're describing.

Are we talking about agent-written changes now, or human? Normally reviewers expect tests to pass before they review something, otherwise the work might change significantly after they did the review in order to fix broken tests. Auto merges can fail due to changes that happened in the meantime, they're aren't auto in many cases.

Once latency goes beyond a minute or two people get distracted and start switching tasks to something else, which slows everything down. And yes code review latency is a problem as well, but there are easier fixes for that.

↑