Most active commenters

    ←back to thread

    492 points storf45 | 12 comments | | HN request time: 0.424s | source | bottom
    Show context
    softwaredoug ◴[] No.42157774[source]
    The way to deal with this is to constantly do live events, and actually build organizational muscle. Not these massive one off events in an area the tech team has no experience in.
    replies(9): >>42158542 #>>42158774 #>>42158782 #>>42158854 #>>42158930 #>>42159942 #>>42160430 #>>42160978 #>>42168444 #
    mbrumlow ◴[] No.42158774[source]
    I have this argument a lot in tech.

    We should always be doing (the thing we want to do)

    Somme examples that always get me in trouble (or at least big heated conversations)

    1. Always be building: It does not matter if code was not changed, or there has been no PRs or whatever, build it. Something in your org or infra has likely changed. My argument is "I would rather have a build failure on software that is already released, than software I need to release".

    2. Always be releasing: As before it does not matter if nothing changed, push out a release. Stress the system and make it go through the motions. I can't tell you how many times I have seen things fail to deploy simply because they have not attempted to do so in some long period of time.

    There are more just don't have time to go into them. The point is if "you did it, and need to do it again ever in the future, then you need to continuously do it"

    replies(6): >>42158807 #>>42158896 #>>42159793 #>>42159969 #>>42161140 #>>42161623 #
    1. andai ◴[] No.42158896[source]
    This is great, but what possible counterargument is there? We should prolong indefinitely a spooky ambiguity about whether the system works or not?
    replies(6): >>42158935 #>>42158962 #>>42159076 #>>42159241 #>>42159259 #>>42159634 #
    2. mplewis ◴[] No.42158935[source]
    The common and flawed counterargument is “when we deploy, outages happen.” You’ll hear this constantly at companies with bad habits.
    3. ukuina ◴[] No.42158962[source]
    Finite compute, people, and opportunity cost.

    It is just a reframing of build vs maintain.

    4. macintux ◴[] No.42159076[source]
    In some environments, deploying to production has a massive bureaucracy tax. Paperwork, approvals, limited windows in time, can’t do them during normal business hours, etc.
    replies(2): >>42159187 #>>42159878 #
    5. ◴[] No.42159187[source]
    6. kortilla ◴[] No.42159241[source]
    Deploying is expensive for some models. That could involve customer facing written release notes, etc. Sometimes the software has to be certified by a govt authority.

    Additionally, refactor circle jerks are terrible for back-porting subsequent bug fixes that need to be cherry picked to stable branches.

    A lot of of the world isn’t CD and constant releases are super expensive.

    7. jerf ◴[] No.42159259[source]
    Easy: Short term risk versus long term risk. If I deploy with minimal changes today, I'm taking a non-zero short-term risk for zero short-term gain.

    While I too am generally a long-term sort of engineer, it's important to understand that this is a valid argument on its own terms, so you don't try to counter it with just "piffle, that's stupid". It's not stupid. It can be shortsighted, it leads to a slippery slope where every day you make that decision it is harder to release next time, and there's a lot of corpses at the bottom of that slope, but it isn't stupid. Sometimes it is even correct, for instance, if the system's getting deprecated away anyhow why take any risk?

    And there is some opportunity cost, too. No matter how slick the release, it isn't ever free. Even if it's all 100% automated it's still going to barf sometimes and require attention that not making a new release would not have. You could be doing something else with that time.

    8. rconti ◴[] No.42159634[source]
    The counterargument is obvious for anyone who has been on call or otherwise responsible for system stability. It's very easy to become risk-averse in any realm.
    replies(1): >>42160484 #
    9. josho ◴[] No.42159878[source]
    Those taxes were often imposed because of past engineering errors. For example, Don't deploy during business hours because a past deployment took down production for a day.

    A great engineering team will identify a tax they dislike and work to remove it. Using the same example, that means improving the success rate of deployments so you have the data (the success record) to take to leadership to change the policy and remove the tax.

    10. andai ◴[] No.42160484[source]
    Doesn't ensuring stuff actually works tangibly lower risk?
    replies(2): >>42161786 #>>42161813 #
    11. Jach ◴[] No.42161786{3}[source]
    Not exactly, but it's worth the experiment in trying things anyway. Say you currently have a release once every few months, an ambitious goal would be to get to weekly releases. Continuous enough by comparison. But 'lower risk' is probably not the leading argument for the change, especially if the quarterly cycle has worked well enough, and the transition itself increases risk for a while. In order for a committed attempt to not devolve into a total dumpster fire, various other practices will need to be added, removed, or changed. (For example, devs might learn the concept of feature flags.) The individuals, which include management, might not be able to pull it off.
    12. mbrumlow ◴[] No.42161813{3}[source]
    Yes. Because it lowers the chance compound risk. The longer you go without stressing the system the more likely you will have a double failure, thus increasing your outage duration.

    Simply put. You don’t want to delay funding out something is broke, you want to know the second it is broken.

    The the case I am suggesting, a failed release will be often deploying the same functionality, thus many failure modes will result in zero outage. It all failure modes will result in an outage.

    When the software is expected to behave differently after the deployment, more systems can result in being part of the outage. Such as the new systems can’t do something or the old systems can’t do something.