Mercury: Ultra-fast language models based on diffusion

(arxiv.org)

568 points PaulHoule | 2 comments | 07 Jul 25 12:31 UTC | HN request time: 0.564s | source

Show context

mike_hearn ◴[07 Jul 25 13:46 UTC] No.44490340[source]▶

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #

mrkeen ◴[07 Jul 25 19:00 UTC] No.44493577[source]▶

>>44490340 #

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers

No, this is common. The devs just haven't grokked dependency inversion. And I think the rate of new devs entering the workforce will keep it that way forever.

Here's how to make it slow:

* Always refer to "the database". You're not just storing and retrieving objects from anywhere - you're always using the database.

* Work with statements, not expressions. Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance. This will force you to sequentialise the tests (simultaneous tests would otherwise race and cause flakiness) plus you get to write a bunch of setup and teardown and wipe state between tests.

* If you've done the above, you'll probably need to wait for state changes before running an assertion. Use a thread sleep, and if the test is ever flaky, bump up the sleep time and commit it if the test goes green again.

replies(2): >>44496355 #>>44497424 #

1. zbentley ◴[08 Jul 25 02:10 UTC] No.44496355[source]▶

>>44493577 #

> Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance

Er, doesn’t this boil down to saying “not testing database end state (trusting in transactionality) is faster than testing it”?

I mean sure, trivially true, but not a good idea. I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler. Bad code, to be sure, but common in many contexts in my experience.

replies(1): >>44497070 #

2. mrkeen ◴[08 Jul 25 04:24 UTC] No.44497070[source]▶

>>44496355 (TP) #

> I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler.

Yes! That's my current codebase you're describing! If you interweave the database all throughout your accounting logic, you absolutely can bury those kinds of problems for people to find later. But remember, one test at a time so that you don't accidentally discover that your the database transactions aren't protecting you nearly as well as you thought.

In fact, screw database transactions. Pay the cost of object-relation impedance mismatch and unscalable joins, but make sure you avoid the benefits, by turning off ACID for performance reasons (probably done for you already) and make heavy use of LINQ so that values are loaded in and out of RAM willy-nilly and thereby escape their transaction scopes.

The C# designers really leaned into the 'statements' not 'expression' idea! There's no transaction context object returned from beginTrans which could be passed into subsequent operations (forming a nice expression) and thereby clear up any "am I in a transaction?" questions.

But yeah, right now it's socially acceptable to plumb the database crap right through the business logic. If we could somehow put CSS or i18n in the business logic, we'd need to put a browser into our test suite too!

↑