Most active commenters

    ←back to thread

    492 points storf45 | 13 comments | | HN request time: 0.577s | source | bottom
    Show context
    shermantanktop ◴[] No.42160502[source]
    Every time a big company screws up, there are two highly informed sets of people who are guaranteed to be lurking, but rarely post, in a thread like this:

    1) those directly involved with the incident, or employees of the same company. They have too much to lose by circumventing the PR machine.

    2) people at similar companies who operate similar systems with similar scale and risks. Those people know how hard this is and aren’t likely to publicly flog someone doing their same job based on uninformed speculation. They know their own systems are Byzantine and don’t look like what random onlookers think it would look like.

    So that leaves the rest, who offer insights based on how stuff works at a small scale, or better yet, pronouncements rooted in “first principles.”

    replies(15): >>42160568 #>>42160576 #>>42160579 #>>42160888 #>>42160913 #>>42161148 #>>42161164 #>>42161399 #>>42161529 #>>42161703 #>>42161724 #>>42161889 #>>42165352 #>>42166894 #>>42167814 #
    1. karaterobot ◴[] No.42160579[source]
    The only time I worked on a project that had a live television launch, it absolutely tipped over within like 2 minutes, and people on HN and Reddit were making fun of it. And I know how hard everyone worked, and how competent they were, so I sympathize with the people in these cases. While the internet was teeing off with easy jokes, engineers were swarming on a problem that was just not resolving, PMs were pacing up and down the hallway, people were getting yelled at by leadership, etc. It's like taking all the stress and complexity of a product launch and multiplying it by 100. And the thing I'm talking about was just a website, not even a live video stream.
    replies(6): >>42160663 #>>42160778 #>>42161112 #>>42161381 #>>42161710 #>>42189210 #
    2. swyx ◴[] No.42160663[source]
    what was the ultimate cause/fix of issues in your case? a database thing?
    replies(1): >>42161392 #
    3. jillyboel ◴[] No.42160778[source]
    > people were getting yelled at by leadership

    this is where you get up and leave

    4. ryoshu ◴[] No.42161112[source]
    Those are the times when you identify who is there to help and who is there to be performative.
    replies(1): >>42162707 #
    5. pdimitar ◴[] No.42161381[source]
    You cannot leave us hanging like that. What was the issue?
    6. nikau ◴[] No.42161392[source]
    Insufficient testing
    replies(1): >>42161485 #
    7. windexh8er ◴[] No.42161485{3}[source]
    While that may be the case, the things like this I've experienced have been more along the lines of incompetent management.

    In one case I was doing an upgrade on an IPTV distribution network for a cable provider (15+ years ago at this point). This particular segment of subscribers totalled more than 100k accounts. I did validation of the hardware and software rev installed on the routers in question prior to my trip to the data center (2+ hour drive). I informed management that the currently running version on the router wasn't compatible with this hardware rev of card I was upgrading to. I was told that it would in fact work, that we had that same combination of hw/sw running elsewhere. I couldn't find it when I went to go look at other sites. I mentioned it in email prior to leaving I was told to go.

    Long story short, the card didn't work, had to back it out. The HA failover didn't work on the downgrade and took down all of those subscribers as the total outage caused a cascading issue with some other gear in this facility. All in all it was during off-peak time of day, but it was a waste of time and customer sat.

    8. adamredwoods ◴[] No.42161710[source]
    Some breaks are just too difficult to predict. For example, I work in ecommerce and we had a page break because the content team pushed too many items into an array, that caused a back-end service to throw errors. Because we were the middle-service, taking from the CMS and making the request to back-end, not sure how we could have seen that issue coming in advance (and no one knew there was a limit).
    replies(2): >>42161948 #>>42162423 #
    9. tuukkah ◴[] No.42161948[source]
    I'm not saying it's easy, but start by assuming that there's a limit and that any request can throw errors? (Proceed accordingly .)
    replies(1): >>42162857 #
    10. steve_adams_86 ◴[] No.42162423[source]
    > Some breaks are just too difficult to predict.

    Absolutely. I think a great filter for developers is determining how well they understand this. Over-simplification of problems and certainty about one’s ability to build reliable services at scale is a massive red flag to me.

    I have to say some of the hardest challenges I’ve encountered were in e-commerce, too.

    It’s a lot harder and more interesting than I think many people realize. I learned so much working on those projects.

    In one case, the system relied on SQLite and god damn did things go sideways as the company grew its customer base. That was the fastest database migration project I’ve ever been on, haha.

    I often think it could have worked today. SQLite has made huge leaps in the areas we were struggling. I’m not sure it would have been a forever solution (the company is massive now), but it would have bought us some much-needed time. It’s funny how that stuff changes. A lot of my takeaways about SQLite 10 years ago don’t apply quite the same anymore. I use it for things now that I never would have back then.

    11. shermantanktop ◴[] No.42162707[source]
    Those performative people are worse than useless. They take up critical bandwidth and add no real value.

    An effective operational culture has methods for removing those people from the conversations that matter. Unfortunately that earns you a reputation for being “cutthroat” or “lacking empathy.”

    Both of those are real things, but it’s the C players who claim they are being unfairly treated, when in fact their limelight-seeking behavior is the problem.

    If all that sounds harsh, like the kitchen on The Bear, well…that’s kinda how it is sometimes. Not everyone thrives in that environment, and arguably the ones who do are a little “off.”

    12. adamredwoods ◴[] No.42162857{3}[source]
    All requests expect errors. How a developer handles them... well...

    And for limit checking, how often do you write array limit handlers? And if the BE contract doesn't specify? Additionally, it will need as a regression unit test, because who knows when the next developer will remove that limit check.

    13. seanp2k2 ◴[] No.42189210[source]
    Shoulda used Varnish.