Most active commenters

    ←back to thread

    566 points PaulHoule | 15 comments | | HN request time: 0.824s | source | bottom
    Show context
    mike_hearn ◴[] No.44490340[source]
    A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

    Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

    As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

    It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

    Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

    replies(28): >>44490408 #>>44490637 #>>44490652 #>>44490785 #>>44491195 #>>44491421 #>>44491483 #>>44491551 #>>44491898 #>>44492096 #>>44492183 #>>44492230 #>>44492386 #>>44492525 #>>44493236 #>>44493262 #>>44493392 #>>44493568 #>>44493577 #>>44495068 #>>44495946 #>>44496321 #>>44496534 #>>44497037 #>>44497707 #>>44498689 #>>44502041 #>>44504650 #
    kccqzy ◴[] No.44490652[source]
    > Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.

    I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.

    > make builds fully hermetic (so no inter-run caching)

    These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

    replies(16): >>44490726 #>>44490764 #>>44491015 #>>44491034 #>>44491088 #>>44491949 #>>44491953 #>>44492546 #>>44493309 #>>44494481 #>>44494583 #>>44495174 #>>44496510 #>>44497007 #>>44500400 #>>44513737 #
    1. mike_hearn ◴[] No.44490764[source]
    I was also at Google for years. Places like that are not even close to representative. They can afford to just-throw-more-resources, they get bulk discounts on hardware and they pay top dollar for engineers.

    In more common scenarios that represent 95% of the software industry CI budgets are fixed, clusters are sized to be busy most of the time, and you cannot simply launch 10,000 copies of the same test on 10,000 machines. And even despite that these CI clusters can easily burn through the equivalent of several SWE salaries.

    > These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

    Again, that's how companies like Google do it. In normal companies, build caching isn't always perfectly reliable, and if CI runs suffer flakes due to caching then eventually some engineer is gonna get mad and convince someone else to turn the caching off. Blaze goes to extreme lengths to ensure this doesn't happen, and Google spends extreme sums of money on helping it do that (e.g. porting third party libraries to use Blaze instead of their own build system).

    In companies without money printing machines, they sacrifice caching to get determinism and everything ends up slow.

    replies(4): >>44491160 #>>44492797 #>>44498410 #>>44498934 #
    2. PaulHoule ◴[] No.44491160[source]
    Most of my experience writing concurrent/parallel code in (mainly) Java has been rewriting half-baked stuff that would need a lot of testing with straightforward reliable and reasonably performant code that uses sound and easy-to-use primitives such as Executors (watch out for teardown though), database transactions, atomic database operations, etc. Drink the Kool Aid and mess around with synchronized or actors or Streams or something and you're looking at a world of hurt.

    I've written a limited number of systems that needed tests that probe for race conditions by doing something like having 3000 threads run a random workload for 40 seconds. I'm proud of that "SuperHammer" test on a certain level but boy did I hate having to run it with every build.

    replies(1): >>44500415 #
    3. kridsdale1 ◴[] No.44492797[source]
    I’m at Google today and even with all the resources, I am absolutely most bottlenecked by the Presubmit TAP and human review latency. Making CLs in the editor takes me a few hours. Getting them in the system takes days and sometimes weeks.
    replies(4): >>44495094 #>>44497595 #>>44498895 #>>44503920 #
    4. simonw ◴[] No.44495094[source]
    Presumably the "days and sometimes weeks" thing is entirely down to human review latency?
    replies(1): >>44496441 #
    5. refulgentis ◴[] No.44496441{3}[source]
    Yes and no, I'd estimate 1/3 to 1/2 of that is down to test suites are flaky and time-consuming to run. IIRC shortest build I had was 52m for Android Wear iOS app, easily 3 hours for Android.
    replies(1): >>44501441 #
    6. to23iho34324 ◴[] No.44497595[source]
    Indeed. You'd think Google would test for how well people will cope with boredom, rather than their bait-and-switch interviews that make it seem like you'll be solving l33tcode every evening.
    replies(1): >>44497989 #
    7. ozim ◴[] No.44497989{3}[source]
    You think people work on a single issue at a time?

    Maybe at Google they can afford that, where I worked at some point I was working 2 or 3 projects switching between issues. Of course all projects were the same tech and mostly the same setup, but business logic and tasks were different.

    If I have to wait 2-3 hours I have code to review, bug fixes in different places to implement. Even on a single project if you wait 2 hours till your code lands test env and have nothing else to do someone is mismanaging the process.

    replies(1): >>44505908 #
    8. Aeolun ◴[] No.44498410[source]
    This feels incredibly weird. I and my team never wait on CI for very long because we just throw more machines at the problem. Supplying a whole team of SE’s with unlimited CI costs us the equivalent of 1/4th SE’s salary. We don’t use anything but Github’s built-in caching, and yeah, that’s the main thing making CI slower right now. Never more than 5 minutes though. We certainly never wait for machines to free up.
    9. phkahler ◴[] No.44498895[source]
    >> Making CLs in the editor takes me a few hours. Getting them in the system takes days and sometimes weeks.

    You just need an AI agent to shepherd it through the slow process for you while you work on something else!

    10. codethief ◴[] No.44498934[source]
    > In normal companies, build caching isn't always perfectly reliable

    In normal companies you often don't have build & task caching to begin with. Heck, people often don't even know how Docker image layer caching works.

    11. switchbak ◴[] No.44500415[source]
    I’m all for n boring tech and all, and leveraging the simplest thing that can work. But arguing against streams? You mean reactive streams?

    These have helped me replace megabytes of poorly written concurrent/parallel crap with a few lines of stream orchestration. I find it interesting that our experiences diverge so wildly.

    Then again, I’ve had to rewrite some terribly clever code, some downright diabolical, so that could be part of it.

    replies(1): >>44500548 #
    12. PaulHoule ◴[] No.44500548{3}[source]
    I mean the streams library in JDK 8.

    Reactive streams are great. I wrote a data transformation toolkit which used reactive streams for the data plane (passing small RDF documents along the pipes) and the Jena rules engine for the control plane (assembling the reactive stream pipeline and tearing it down)

    I later worked with some people who built something with a similar architecture -- their system didn't get the same answer every time because they didn't handle teardown properly but that's a common problem and not hard to fix. Lots of people forget to do it, even with Executors.

    13. xemdetia ◴[] No.44501441{4}[source]
    While not at Google for myself a lot of the CI test failures just become knock on effects from complex interdependent CI components delivering the whole experience. Oops Artifactory or GitHub rate limited you. Oops the SAST checker from some new vendor just never finished. Even if your code passes locally the added complexity of CI can often be fraught with flaky and confusing errors that are intermittent or run afoul based on environmental problems that particular moment you tried.
    14. etruong42 ◴[] No.44503920[source]
    I am also at Google, and I can corroborate this experience personally and corroborate this based off of comments teammates make to me, in group settings, and in team retrospectives.

    There are a lot of technical challenges in maintaining code health in a monorepo with 100k+ active contributors, so teams and individuals get a lot of plausible excuses for kicking the problem down the road, and truly improving code health is not appropriately incentivized. One common occurrence is a broken monorepo, so one just waits until someone fixes the monorepo, and you retry submitting your code again. It's such a common occurrence that people generally do not investigate brokenness, and maybe the monorepo wasn't broken but your code change actually made things even flakier, but no one would be able to distinguish that from a broken monorepo that eventually got fixed when no one bothers to check anymore.

    15. to23iho34324 ◴[] No.44505908{4}[source]
    Dude, I've worked at Google.

    It's an overpaid overglorified boring job (obv. outside all the 'cool' research-y projects).