←back to thread

492 points storf45 | 1 comments | | HN request time: 0.202s | source
Show context
softwaredoug ◴[] No.42157774[source]
The way to deal with this is to constantly do live events, and actually build organizational muscle. Not these massive one off events in an area the tech team has no experience in.
replies(9): >>42158542 #>>42158774 #>>42158782 #>>42158854 #>>42158930 #>>42159942 #>>42160430 #>>42160978 #>>42168444 #
mbrumlow ◴[] No.42158774[source]
I have this argument a lot in tech.

We should always be doing (the thing we want to do)

Somme examples that always get me in trouble (or at least big heated conversations)

1. Always be building: It does not matter if code was not changed, or there has been no PRs or whatever, build it. Something in your org or infra has likely changed. My argument is "I would rather have a build failure on software that is already released, than software I need to release".

2. Always be releasing: As before it does not matter if nothing changed, push out a release. Stress the system and make it go through the motions. I can't tell you how many times I have seen things fail to deploy simply because they have not attempted to do so in some long period of time.

There are more just don't have time to go into them. The point is if "you did it, and need to do it again ever in the future, then you need to continuously do it"

replies(6): >>42158807 #>>42158896 #>>42159793 #>>42159969 #>>42161140 #>>42161623 #
skybrian ◴[] No.42159793[source]
Doing dry runs regularly makes sense, but whether actually shipping it makes sense seems context-dependent. It depends on how much you can minimize the side effects of shipping a release.

Consider publishing a new version of a library: you'd be bumping the version number all the time and invalidating caches, causing downstream rebuilds, for little reason. Or if clients are lazy about updating, any two clients would be unlikely to have the same version.

Or consider the case when shipping results in a software update: millions of customer client boxes wasting bandwidth downloading new releases and restarting for no reason.

Even for a web app, you are probably invalidating caches, resulting in slow page loads.

With enough work, you could probably minimize these side effects, so that releasing a new version that doesn't actually change anything is a non-event. But if you don't invalidate the caches, you're not really doing a full rebuild.

So it seems like there's a tension between doing more end-to-end testing and performance? Implementing a bunch of cache levels and then not using it seems counterproductive.

replies(4): >>42159844 #>>42160284 #>>42161483 #>>42161780 #
lxgr ◴[] No.42159844[source]
It's very hard to do a representative dry run when the most likely potential points of failure are highly load-dependent.

You can try and predict everything that'll happen in production, but if you have nothing to extrapolate from, e.g. because this is your very first large live event, the chances of getting that right are almost zero.

And you can't easily import that knowledge either, because your system might have very different points of failure than the ones external experts might be used to.

replies(2): >>42159985 #>>42160723 #
1. ◴[] No.42159985[source]