←back to thread

492 points storf45 | 2 comments | | HN request time: 0.431s | source
Show context
shermantanktop ◴[] No.42160502[source]
Every time a big company screws up, there are two highly informed sets of people who are guaranteed to be lurking, but rarely post, in a thread like this:

1) those directly involved with the incident, or employees of the same company. They have too much to lose by circumventing the PR machine.

2) people at similar companies who operate similar systems with similar scale and risks. Those people know how hard this is and aren’t likely to publicly flog someone doing their same job based on uninformed speculation. They know their own systems are Byzantine and don’t look like what random onlookers think it would look like.

So that leaves the rest, who offer insights based on how stuff works at a small scale, or better yet, pronouncements rooted in “first principles.”

replies(15): >>42160568 #>>42160576 #>>42160579 #>>42160888 #>>42160913 #>>42161148 #>>42161164 #>>42161399 #>>42161529 #>>42161703 #>>42161724 #>>42161889 #>>42165352 #>>42166894 #>>42167814 #
karaterobot ◴[] No.42160579[source]
The only time I worked on a project that had a live television launch, it absolutely tipped over within like 2 minutes, and people on HN and Reddit were making fun of it. And I know how hard everyone worked, and how competent they were, so I sympathize with the people in these cases. While the internet was teeing off with easy jokes, engineers were swarming on a problem that was just not resolving, PMs were pacing up and down the hallway, people were getting yelled at by leadership, etc. It's like taking all the stress and complexity of a product launch and multiplying it by 100. And the thing I'm talking about was just a website, not even a live video stream.
replies(6): >>42160663 #>>42160778 #>>42161112 #>>42161381 #>>42161710 #>>42189210 #
swyx ◴[] No.42160663[source]
what was the ultimate cause/fix of issues in your case? a database thing?
replies(1): >>42161392 #
1. nikau ◴[] No.42161392[source]
Insufficient testing
replies(1): >>42161485 #
2. windexh8er ◴[] No.42161485[source]
While that may be the case, the things like this I've experienced have been more along the lines of incompetent management.

In one case I was doing an upgrade on an IPTV distribution network for a cable provider (15+ years ago at this point). This particular segment of subscribers totalled more than 100k accounts. I did validation of the hardware and software rev installed on the routers in question prior to my trip to the data center (2+ hour drive). I informed management that the currently running version on the router wasn't compatible with this hardware rev of card I was upgrading to. I was told that it would in fact work, that we had that same combination of hw/sw running elsewhere. I couldn't find it when I went to go look at other sites. I mentioned it in email prior to leaving I was told to go.

Long story short, the card didn't work, had to back it out. The HA failover didn't work on the downgrade and took down all of those subscribers as the total outage caused a cascading issue with some other gear in this facility. All in all it was during off-peak time of day, but it was a waste of time and customer sat.