Mercury: Ultra-fast language models based on diffusion

Show context

mike_hearn ◴[07 Jul 25 13:46 UTC] No.44490340[source]▶

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #

kccqzy ◴[07 Jul 25 14:20 UTC] No.44490652[source]▶

>>44490340 #

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.

I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.

> make builds fully hermetic (so no inter-run caching)

These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

replies(16): >>44490726 #>>44490764 #>>44491015 #>>44491034 #>>44491088 #>>44491949 #>>44491953 #>>44492546 #>>44493309 #>>44494481 #>>44494583 #>>44495174 #>>44496510 #>>44497007 #>>44500400 #>>44513737 #

IshKebab ◴[07 Jul 25 14:59 UTC] No.44491088[source]▶

>>44490652 #

Developer time is more expensive than machine time, but at most companies it isn't 10000x more expensive. Google is likely an exception because it pays extremely well and has access to very cheap machines.

Even then, there are other factors:

* You might need commercial licenses. It may be very cheap to run open source code 10000x, but guess how much 10000 Questa licenses cost.

* Moores law is dead Amdahl's law very much isn't. Not everything is embarrassingly parallel.

* Some people care about the environment. I worked at a company that spent 200 CPU hours on every single PR (even to fix typos; I failed to convince them they were insane for not using Bazel or similar). That's a not insignificant amount of CO2.

replies(3): >>44491885 #>>44492841 #>>44498020 #

underdeserver ◴[07 Jul 25 16:19 UTC] No.44491885[source]▶

>>44491088 #

That's solvable with modern cloud offerings - Provision spot instances for a few minutes and shut them down afterwards. Let the cloud provider deal with demand balancing.

I think the real issue is that developers waiting for PRs to go green are taking a coffee break between tasks, not sitting idly getting annoyed. If that's the case you're cutting into rest time and won't get much value out of optimizing this.

replies(1): >>44493258 #

IshKebab ◴[07 Jul 25 18:25 UTC] No.44493258[source]▶

>>44491885 #

Both companies I've worked in recently have been too paranoid about IP to use the cloud for CI.

Anyway I don't see how that solves any of the issues except maybe cost to some degree (but maybe not; cloud is expensive).

replies(3): >>44493548 #>>44495106 #>>44495112 #

simonw ◴[07 Jul 25 22:03 UTC] No.44495106[source]▶

>>44493258 #

Were they running CI on their own physical servers under a desk or in a basement somewhere, or renting their own racks in a data center just for CI?

replies(2): >>44497817 #>>44504352 #