Mercury: Ultra-fast language models based on diffusion

(arxiv.org)

566 points PaulHoule | 1 comments | 07 Jul 25 12:31 UTC | HN request time: 0.222s | source

Show context

mike_hearn ◴[07 Jul 25 13:46 UTC] No.44490340[source]▶

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #

gdiamos ◴[07 Jul 25 18:59 UTC] No.44493568[source]▶

>>44490340 #

This sounds like a strawman.

GPUs can do 1 million trillion instructions per second.

Are you saying it’s impossible to write a test that finishes in less than one second on that machine?

Is that a fundamental limitation or an incredibly inefficient test?

replies(2): >>44494887 #>>44497993 #

1. mike_hearn ◴[08 Jul 25 07:52 UTC] No.44497993[source]▶

>>44493568 #

It's amazing how easy it is to write tests that are slow. Taking >1 second per test is absolutely normal.

> Is that a fundamental limitation or an incredibly inefficient test?

That's the million dollar/month question. If an LLM can diffuse a patch in 3 seconds but it takes 3 hours to test then we have a problem, especially if the LLM needs more test feedback than a human would. But is it a fundamental problem or is it "just" a matter of effort?

I mostly work with JVM based apps in recent years and there's lots of low hanging fruit in tests there. JIT compilation is both a blessing and a curse. You don't waste any time compiling the tests themselves (to machine code), but also, the code that does get compiled is forgotten between runs and build systems like to test different modules in different processes. So every test run of every module starts with slow warmup. There is a lot of work being done at the moment on improving that situation, but a lot of it boils down to poor build systems and that's harder to fix (nobody agrees what a good build system looks like...)

In one of my current projects, I've made the entire test suite run in parallel at the level of individual test classes. This took a bit of work to stop different tests messing with each other's state inside the database, and it revealed some genuine race conditions when apparently unrelated features interacted in buggy ways. But it was definitely worth it for local testing. Unfortunately the CI configuration was then written in such a way that it starts by compiling one of its dependencies, which blows up test time to the point where improvements to the actual tests are nearly irrelevant. This particular CI system is non-standard/in house, and I haven't figured out how to fix it yet.

This kind of story is typical. Many such cases.

↑