Mercury: Ultra-fast language models based on diffusion

(arxiv.org)

568 points PaulHoule | 1 comments | 07 Jul 25 12:31 UTC | HN request time: 0.001s | source

Show context

mike_hearn ◴[07 Jul 25 13:46 UTC] No.44490340[source]▶

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #

kccqzy ◴[07 Jul 25 14:20 UTC] No.44490652[source]▶

>>44490340 #

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.

I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.

> make builds fully hermetic (so no inter-run caching)

These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

replies(16): >>44490726 #>>44490764 #>>44491015 #>>44491034 #>>44491088 #>>44491949 #>>44491953 #>>44492546 #>>44493309 #>>44494481 #>>44494583 #>>44495174 #>>44496510 #>>44497007 #>>44500400 #>>44513737 #

mark_undoio ◴[07 Jul 25 14:27 UTC] No.44490726[source]▶

>>44490652 #

> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem.

I'd personally agree. But this sounds like the kind of thing that, at many companies, could be a real challenge.

Ultimately, you can measure dollars spent on CI workers. It's much harder and less direct to quantify the cost of not having them (until, for instance, people start taking shortcuts with testing and a regression escapes to production).

That kind of asymmetry tends, unless somebody has a strong overriding vision of where the value really comes from, to result in penny pinching on the wrong things.

replies(1): >>44490981 #

1. mike_hearn ◴[07 Jul 25 14:50 UTC] No.44490981[source]▶

>>44490726 #

It's more than that. You can measure salaries too, measurement isn't the issue.

The problem is that if you let people spend the companies money without any checks or balances they'll just blow through unlimited amounts of it. That's why companies always have lots of procedures and policies around expense reporting. There's no upper limit to how much money developers will spend on cloud hardware given the chance, as the example above of casually running a test 10,000 times in parallel demonstrates nicely.

CI doesn't require you to fill out an expense report every time you run a PR thank goodness, but there still has to be a way to limit financial liability. Usually companies do start out by doubling cluster sizes a few times, but each time it buys a few months and then the complaints return. After a few rounds of this managers realize that demand is unlimited and start pushing back on always increasing the budget. Devs get annoyed and spend an afternoon on optimizations, suddenly times are good again.

The meme on HN is that developer time is always more expensive than machine time, but I've been on both sides of this and seen how the budgets work out. It's often not true, especially if you use clouds like Azure which are overloaded and expensive, or have plenty of junior devs, and/or teams outside the US where salaries are lower. There's often a lot of low hanging fruit in test times so it can make sense to optimize, even so, huge waste is still the order of the day.

↑