Mercury: Ultra-fast language models based on diffusion

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

100% agree.

One of the core premises of what we've been trying to do with our product (Testkube) is to decouple Testing from CI/CD's. Those were never built with testing in mind, let alone scaling to 100's or 1000's of efficient executions. We have a light weight open-source agent, which lives inside a K8s cluster, tests are stored as CRD's cloned from your GIT, executed as K8's jobs. Create whatever heuristics or parallelization necessary, leverage the power of K8s to dynamically scale compute resources as needed, trigger executions by whatever means (GitHub Actions, K8s' events, schedule, etc.), do it on your existing infra.

Admittedly, we don't solve the test creation problem. If there are new tools out there which could automagically generate tests along with code, please share.