←back to thread

492 points storf45 | 3 comments | | HN request time: 0s | source
Show context
softwaredoug ◴[] No.42157774[source]
The way to deal with this is to constantly do live events, and actually build organizational muscle. Not these massive one off events in an area the tech team has no experience in.
replies(9): >>42158542 #>>42158774 #>>42158782 #>>42158854 #>>42158930 #>>42159942 #>>42160430 #>>42160978 #>>42168444 #
mbrumlow ◴[] No.42158774[source]
I have this argument a lot in tech.

We should always be doing (the thing we want to do)

Somme examples that always get me in trouble (or at least big heated conversations)

1. Always be building: It does not matter if code was not changed, or there has been no PRs or whatever, build it. Something in your org or infra has likely changed. My argument is "I would rather have a build failure on software that is already released, than software I need to release".

2. Always be releasing: As before it does not matter if nothing changed, push out a release. Stress the system and make it go through the motions. I can't tell you how many times I have seen things fail to deploy simply because they have not attempted to do so in some long period of time.

There are more just don't have time to go into them. The point is if "you did it, and need to do it again ever in the future, then you need to continuously do it"

replies(6): >>42158807 #>>42158896 #>>42159793 #>>42159969 #>>42161140 #>>42161623 #
andai ◴[] No.42158896[source]
This is great, but what possible counterargument is there? We should prolong indefinitely a spooky ambiguity about whether the system works or not?
replies(6): >>42158935 #>>42158962 #>>42159076 #>>42159241 #>>42159259 #>>42159634 #
rconti ◴[] No.42159634{3}[source]
The counterargument is obvious for anyone who has been on call or otherwise responsible for system stability. It's very easy to become risk-averse in any realm.
replies(1): >>42160484 #
1. andai ◴[] No.42160484{4}[source]
Doesn't ensuring stuff actually works tangibly lower risk?
replies(2): >>42161786 #>>42161813 #
2. Jach ◴[] No.42161786[source]
Not exactly, but it's worth the experiment in trying things anyway. Say you currently have a release once every few months, an ambitious goal would be to get to weekly releases. Continuous enough by comparison. But 'lower risk' is probably not the leading argument for the change, especially if the quarterly cycle has worked well enough, and the transition itself increases risk for a while. In order for a committed attempt to not devolve into a total dumpster fire, various other practices will need to be added, removed, or changed. (For example, devs might learn the concept of feature flags.) The individuals, which include management, might not be able to pull it off.
3. mbrumlow ◴[] No.42161813[source]
Yes. Because it lowers the chance compound risk. The longer you go without stressing the system the more likely you will have a double failure, thus increasing your outage duration.

Simply put. You don’t want to delay funding out something is broke, you want to know the second it is broken.

The the case I am suggesting, a failed release will be often deploying the same functionality, thus many failure modes will result in zero outage. It all failure modes will result in an outage.

When the software is expected to behave differently after the deployment, more systems can result in being part of the outage. Such as the new systems can’t do something or the old systems can’t do something.