Most active commenters
  • mbrumlow(5)
  • (3)
  • leptons(3)

←back to thread

492 points storf45 | 57 comments | | HN request time: 1.295s | source | bottom
1. softwaredoug ◴[] No.42157774[source]
The way to deal with this is to constantly do live events, and actually build organizational muscle. Not these massive one off events in an area the tech team has no experience in.
replies(9): >>42158542 #>>42158774 #>>42158782 #>>42158854 #>>42158930 #>>42159942 #>>42160430 #>>42160978 #>>42168444 #
2. _dark_matter_ ◴[] No.42158542[source]
Agreed. This is a management failure, full stop. Unbelievable that they'd expect engineering to handle a single Livestream event of this magnitude.
3. mbrumlow ◴[] No.42158774[source]
I have this argument a lot in tech.

We should always be doing (the thing we want to do)

Somme examples that always get me in trouble (or at least big heated conversations)

1. Always be building: It does not matter if code was not changed, or there has been no PRs or whatever, build it. Something in your org or infra has likely changed. My argument is "I would rather have a build failure on software that is already released, than software I need to release".

2. Always be releasing: As before it does not matter if nothing changed, push out a release. Stress the system and make it go through the motions. I can't tell you how many times I have seen things fail to deploy simply because they have not attempted to do so in some long period of time.

There are more just don't have time to go into them. The point is if "you did it, and need to do it again ever in the future, then you need to continuously do it"

replies(6): >>42158807 #>>42158896 #>>42159793 #>>42159969 #>>42161140 #>>42161623 #
4. MisterBastahrd ◴[] No.42158782[source]
The WWE is moving their programming to Netflix next year. If I were them, I'd be horrified at what I saw.
5. parasti ◴[] No.42158807[source]
This is golden advice, honestly. "If you don't use it, you lose it" applied to software development.
6. geor9e ◴[] No.42158854[source]
They've been doing live events since 2023. But it's hard to be prepared for something that's never been done by anyone before — a superbowl scale event, entirely viewed over the internet. The superbowl gets to offload to cable and over the air. Interestingly, I didn't have any problems with my stream. So it sounds like the bandwidth problems might be localized, perhaps by data center or ISP.
replies(7): >>42159567 #>>42159816 #>>42160225 #>>42161436 #>>42161557 #>>42164734 #>>42165472 #
7. andai ◴[] No.42158896[source]
This is great, but what possible counterargument is there? We should prolong indefinitely a spooky ambiguity about whether the system works or not?
replies(6): >>42158935 #>>42158962 #>>42159076 #>>42159241 #>>42159259 #>>42159634 #
8. ignoramous ◴[] No.42158930[source]
> ...the tech team has no experience in

Unless Netflix eng decides to release a public postmorterm, we can only speculate. In my time organizing small-time live streams, we always had up to 3 parallel "backup" streams (Vimeo, Cloudflare, Livestream). At Netflix's scale, I doubt they could simply summon any of these providers in, but I guess Akamai / Cloudflare would have been up for it.

9. mplewis ◴[] No.42158935{3}[source]
The common and flawed counterargument is “when we deploy, outages happen.” You’ll hear this constantly at companies with bad habits.
10. ukuina ◴[] No.42158962{3}[source]
Finite compute, people, and opportunity cost.

It is just a reframing of build vs maintain.

11. macintux ◴[] No.42159076{3}[source]
In some environments, deploying to production has a massive bureaucracy tax. Paperwork, approvals, limited windows in time, can’t do them during normal business hours, etc.
replies(2): >>42159187 #>>42159878 #
12. ◴[] No.42159187{4}[source]
13. kortilla ◴[] No.42159241{3}[source]
Deploying is expensive for some models. That could involve customer facing written release notes, etc. Sometimes the software has to be certified by a govt authority.

Additionally, refactor circle jerks are terrible for back-porting subsequent bug fixes that need to be cherry picked to stable branches.

A lot of of the world isn’t CD and constant releases are super expensive.

14. jerf ◴[] No.42159259{3}[source]
Easy: Short term risk versus long term risk. If I deploy with minimal changes today, I'm taking a non-zero short-term risk for zero short-term gain.

While I too am generally a long-term sort of engineer, it's important to understand that this is a valid argument on its own terms, so you don't try to counter it with just "piffle, that's stupid". It's not stupid. It can be shortsighted, it leads to a slippery slope where every day you make that decision it is harder to release next time, and there's a lot of corpses at the bottom of that slope, but it isn't stupid. Sometimes it is even correct, for instance, if the system's getting deprecated away anyhow why take any risk?

And there is some opportunity cost, too. No matter how slick the release, it isn't ever free. Even if it's all 100% automated it's still going to barf sometimes and require attention that not making a new release would not have. You could be doing something else with that time.

15. burntalmonds ◴[] No.42159567[source]
Yeah, I think people are incorrectly assuming that everyone had the same experience with the stream. I watched the whole thing and only had a few instances of buffering and quality degradation. Not more than 30 seconds total during the stream.
replies(1): >>42159760 #
16. rconti ◴[] No.42159634{3}[source]
The counterargument is obvious for anyone who has been on call or otherwise responsible for system stability. It's very easy to become risk-averse in any realm.
replies(1): >>42160484 #
17. DharmaPolice ◴[] No.42159760{3}[source]
Even if it was only 30% of people had a problem that's still millions of unhappy users. Not great for a time sensitive event.

Also, from lurking in various threads on the topic Netflix's in app messages added to people's irritation by suggesting that they check their WiFi/internet was working. Presumably that's the default error message but perhaps that could have been adjusted in advance somehow.

replies(2): >>42161566 #>>42163798 #
18. skybrian ◴[] No.42159793[source]
Doing dry runs regularly makes sense, but whether actually shipping it makes sense seems context-dependent. It depends on how much you can minimize the side effects of shipping a release.

Consider publishing a new version of a library: you'd be bumping the version number all the time and invalidating caches, causing downstream rebuilds, for little reason. Or if clients are lazy about updating, any two clients would be unlikely to have the same version.

Or consider the case when shipping results in a software update: millions of customer client boxes wasting bandwidth downloading new releases and restarting for no reason.

Even for a web app, you are probably invalidating caches, resulting in slow page loads.

With enough work, you could probably minimize these side effects, so that releasing a new version that doesn't actually change anything is a non-event. But if you don't invalidate the caches, you're not really doing a full rebuild.

So it seems like there's a tension between doing more end-to-end testing and performance? Implementing a bunch of cache levels and then not using it seems counterproductive.

replies(4): >>42159844 #>>42160284 #>>42161483 #>>42161780 #
19. mastazi ◴[] No.42159816[source]
Maybe they considered this event as a rehearsal for the upcoming NFL streams which I am guessing might have a wider audience
replies(3): >>42159938 #>>42162015 #>>42200724 #
20. lxgr ◴[] No.42159844{3}[source]
It's very hard to do a representative dry run when the most likely potential points of failure are highly load-dependent.

You can try and predict everything that'll happen in production, but if you have nothing to extrapolate from, e.g. because this is your very first large live event, the chances of getting that right are almost zero.

And you can't easily import that knowledge either, because your system might have very different points of failure than the ones external experts might be used to.

replies(2): >>42159985 #>>42160723 #
21. josho ◴[] No.42159878{4}[source]
Those taxes were often imposed because of past engineering errors. For example, Don't deploy during business hours because a past deployment took down production for a day.

A great engineering team will identify a tax they dislike and work to remove it. Using the same example, that means improving the success rate of deployments so you have the data (the success record) to take to leadership to change the policy and remove the tax.

22. ◴[] No.42159938{3}[source]
23. giantg2 ◴[] No.42159942[source]
Wow, building talent from within? I thought that went out of fashion. I think companies are too impatient to develop their employees.
24. unoti ◴[] No.42159969[source]
> 1. Always be building: It does not matter if code was not changed...

> 2. Always be releasing...

A good argument for this is security. Whatever libraries/dependencies you have, unpin the versions, and have good unit tests. Security vulnerabilities that are getting fixed upstream must be released. You cannot fix and remove those vulnerabilities unless you are doing regular releases. This in turn also implies having good unit tests, so you can do these builds and releases with a lower probability of releasing something broken. It also implies strong monitoring and metrics, so you can be the first to know when something breaks.

replies(2): >>42160446 #>>42161034 #
25. ◴[] No.42159985{4}[source]
26. firesteelrain ◴[] No.42160225[source]
I had issues here and there but there was workarounds. Then, towards the end, the quality either auto negotiated or was forced down to accommodate the massive pull.
27. bonestamp2 ◴[] No.42160284{3}[source]
I like all of these considerations, although I also imagine for every context there is some frequency at which it is worthwhile to invalidate the caches to ensure that all parts of the system are still functioning as expected (including the rebuilding of the caches).
replies(1): >>42161789 #
28. eh9 ◴[] No.42160430[source]
that’s difficult to reproduce at scale; there are only so many “super bowl” events in a calendar year
29. caseyohara ◴[] No.42160446{3}[source]
> Whatever libraries/dependencies you have, unpin the versions, and have good unit tests.

Nitpick: unit tests by definition should not be exercising dependencies outside the unit boundary. What you want are solid integration and system tests for that.

30. andai ◴[] No.42160484{4}[source]
Doesn't ensuring stuff actually works tangibly lower risk?
replies(2): >>42161786 #>>42161813 #
31. leptons ◴[] No.42160723{4}[source]
They could have done a dry run. They could have spun up a million virtual machines somewhere, and tested their video delivery for 30 minutes. Even my small team spins up 10,000 EC2 instances on the regular. Netflix has the money to do much more. I'm sure there are a dozen ways they could have stress-tested this beforehand. It's not like someone sprang this on them last week and they had to scramble to put together a system to do it.
replies(4): >>42160912 #>>42161245 #>>42164365 #>>42164503 #
32. mewpmewp2 ◴[] No.42160912{5}[source]
Maybe they did. We don't know they did not. But problem is that real world traffic will still be always totally different, varied and dynamic in many unexpected ways and a certain link might be under certain effect causing a ripple effect.
33. don-code ◴[] No.42160978[source]
Sometimes this just isn't feasible for cost reasons.

A company I used to work for ran a few Super Bowl ads. The level of traffic you get during a Super Bowl ad is immense, and it all comes at you in 30 seconds, before going back to a steady-state value just as quickly. The scale pattern is like nothing else I've ever seen.

Super Bowl ads famously seven million dollars. These are things we simply can't repeat year over year, even if we believed it'd generate the same bump in recognition each time.

34. kortilla ◴[] No.42161034{3}[source]
Unless the upstream dependency happens to maintain stable branches, constantly pulling in the latest branches increases your risk of vulnerabilities more than getting the discovered bug patches
35. 01HNNWZ0MV43FF ◴[] No.42161140[source]
There's two other ways I've seen it phrased:

"Test what you fly, and fly what you test" (Supposedly from aviation)

"There should be one joint, and it should be greased regularly" (Referring to cryptosystems I think, but it's the same principle. Things like TLS will ossify if they aren't exercised. QUIC has provisions to prevent this.)

36. lxgr ◴[] No.42161245{5}[source]
How representative is an EC2 instance in a datacenter simulating user behavior really, though?

These would likely have completely different network connectivity and usage patterns, especially if they don't have historical data distributions to draw from because this was their first big live event.

replies(1): >>42161881 #
37. elcritch ◴[] No.42161436[source]
I suspect a lot of it could be related to ISP bandwidth. I streamed it on my phone without issue. Another friend put their TV on their phone’s WiFi which also worked. Could be partly that phone hotspots lower video bandwidth by default.

I suspect it’s a bit of both Netflix issues and ISPs over subscribing bandwidth.

38. yourapostasy ◴[] No.42161483{3}[source]
What I’m seeing in large organizations is tracking dependencies within a team’s scope is better than dependencies between teams, because so many developers between teams are punting on tracking dependencies upon other teams’ artifacts if there isn’t a formal system already in place at the organization to establish contracts between teams along these dependency routes, that automatically handle the state and notifications when changes announcing the intended state change are put through the system. Usually some haphazard dependency representation is embedded into some software and developers call it a day, expecting the software to auto-magically solve a social-technical logistical problem instead of realizing state transitions of the dependencies are not represented and the software could never deliver what they assume.
39. positr0n ◴[] No.42161557[source]
I would guess the majority of the streamed bandwidth was sourced from boxes like these in ISP's points of presences around the globe: https://openconnect.netflix.com/en/

So I agree the problems could have been localized to unique (region, ISP) combinations.

40. positr0n ◴[] No.42161566{4}[source]
One of the times I reloaded the page I got a raw envoy error message!
41. ravenstine ◴[] No.42161623[source]
There should be a caveat that such this kind of decision should be based on experience and not treated as a rule that juniors might blindly follow. We all know how "fail fast and early" turned out (or whatever the exact phrase was).
42. mbrumlow ◴[] No.42161780{3}[source]
These are the very arguments I get all the time.

1) I want to invalidate caches, I want to know that these systems work. I want to know that my software properly handles this situation.

2) if I have lazy clients. I want to know. And I want to motivate them on updating sooner or figure out how to force update them. I don’t want to not update because some people are slow. I want the norm to be it is updating, so when there is a reason to update, like a zero day, I can have some notion that the updates will work and the lazy clients will not be an issue.

I am not talking about fake or dry runs that go through some portion of motions, I want every aspect of the process to be real.

Performance means nothing if your stuff is down. And any perceived performance gained by not doing proper hygiene is just tweaking the numbers to look better than they really are.

replies(1): >>42165988 #
43. Jach ◴[] No.42161786{5}[source]
Not exactly, but it's worth the experiment in trying things anyway. Say you currently have a release once every few months, an ambitious goal would be to get to weekly releases. Continuous enough by comparison. But 'lower risk' is probably not the leading argument for the change, especially if the quarterly cycle has worked well enough, and the transition itself increases risk for a while. In order for a committed attempt to not devolve into a total dumpster fire, various other practices will need to be added, removed, or changed. (For example, devs might learn the concept of feature flags.) The individuals, which include management, might not be able to pull it off.
44. mbrumlow ◴[] No.42161789{4}[source]
This.

I can’t tell you the I bet of times things worked because the cache was hot. And a restart or cache invalidation would actually cause an outage.

Caches must be invalidated at a regular interval. Any system that does not do this is heading for some bad days.

45. mbrumlow ◴[] No.42161813{5}[source]
Yes. Because it lowers the chance compound risk. The longer you go without stressing the system the more likely you will have a double failure, thus increasing your outage duration.

Simply put. You don’t want to delay funding out something is broke, you want to know the second it is broken.

The the case I am suggesting, a failed release will be often deploying the same functionality, thus many failure modes will result in zero outage. It all failure modes will result in an outage.

When the software is expected to behave differently after the deployment, more systems can result in being part of the outage. Such as the new systems can’t do something or the old systems can’t do something.

46. leptons ◴[] No.42161881{6}[source]
>How representative is an EC2 instance in a datacenter simulating user behavior really, though?

Systemic issues causing widespread buffering isn't "user behavior". It's a problem with how Netflix is trying to distribute video. Sure some connections aren't up to the task, and that isn't something Netflix can really control unless they are looking to improve how their player falls-back to lower bitrate video, which could also be tested.

>because this was their first big live event.

That's the point of testing. They should have already had a "big live event" that nobody paid for during automated testing. Instead they seem to have trusted that their very smart and very highly paid developers wouldn't embarrass them based on nothing more than expectations, but they failed. They could have done more rigorous "live" testing before rolling this out to the public.

47. ta1243 ◴[] No.42163798{4}[source]
The point is that if the problem was different depending on the user, it will be in a distribution layer, not in the encoding or production layer

That eliminates a whole raft of problems.

48. throwaway2037 ◴[] No.42164365{5}[source]

    > Even my small team spins up 10,000 EC2 instances on the regular.
Woah, this sounds very cool. Can you share more details?
replies(1): >>42166996 #
49. vel0city ◴[] No.42164503{5}[source]
1) You don't know if they did or did not do this kind of testing. I don't see any proof either way here. You're assuming they didn't.

2) You're assuming whatever issue happened would have been caught by testing on generic EC2 instances in AWS. In the end these streams were going to users on tons of different platforms in lots of different network environments, most of which look nothing like an EC2 instance. Maybe there was something weird with their networking stack on TCL Roku TVs that ended up making network connections reset rapidly chewing up a lot of network resources which led to other issues. What's the EC2 instance type API name for a 55" TCL Roku TV from six years ago on a congested 2.4GHz Wireless N link?

I don't know what happened in their errors. I do know I don't have enough information to say what tests they did or did not run.

50. uep ◴[] No.42164734[source]
My suspicion is the same as yours, that this may have been caused by local ISPs being overwhelmed, but it could be a million other things too. I had network issues. I live in a heavily populated suburban area. I have family who live 1000+ miles away in a slightly less populated suburban area, they had no issues at all.
51. patrick451 ◴[] No.42165472[source]
The ISP hypothesis doesn't make sense to me. I could not stream the live event from Netflix. But I could watch any other show on netflix or youtube or hulu at the same time.
replies(1): >>42168193 #
52. skybrian ◴[] No.42165988{4}[source]
I think it often makes sense to do full releases frequently, but not continuously. For example, Chrome is on an approximately four week schedule, which makes sense for them. Other projects have faster cadences. There is a point of diminishing returns, though, and you seem to be ignoring the downsides.
replies(1): >>42180037 #
53. leptons ◴[] No.42166996{6}[source]
I manage ~3000 customized websites based on the same template code. Sometimes we make changes to the template code that could affect the customizations - it is practically impossible to predict what might cause a problem due to the nature of the customizations. We'll take before and after screenshots of every page on every site, so it can get into the 100s of thousands of screenshots. We'll then run a diff on the screenshots to see what changed, reviewing the screenshots with the most significant changes. Then we'll address the problems we find and deploy the fixed release.

When we do these large screenshot operations, the EC2 instances are running for maybe 15 or 20 minutes total. It's not exactly cheap, but losing clients because we broke their site is something we want to avoid. The sites are hosted on a 3rd party service, and we're rate-limited by IP address, so to get this done in a reasonable amount of time we need to spin up 10,000 EC2 instances to distribute the work. We have our own software to manage the EC2 instances. It's honestly pretty simple, but effective.

54. geor9e ◴[] No.42168193{3}[source]
Some ISPs have on-site Netflix Open Connect racks. The advantage of this is that they get a high-priority quality of service data stream into the rack, which then serves the cached content to the ISP customers. If your ISP doesn't have a big enough Netflix rack and it gets saturated, then you're getting your streams at the whim of congestion on the open internet. A live stream is a few seconds of video downloaded, and it has to make it over the congestion of the internet in a few seconds and then repeat. If a single one of these repeats hits congestion and gets delayed, you see the buffering spinning wheel. Other shows, on the other hand, can show the cached Netflix splash animation for 10 seconds while they request 20 minutes of cache until they get it. So, dropped packets don't matter much. Even if the internet is seeing congestion every couple of minutes, delaying your packets, it won't matter as non-live content is very flexible and patient about when it receives the next 20-minute chunk. I'm not an ISP or Netflix engineer, so don't take these as exact numbers. I'm just explaining how the "bandwidth problems might be localized" hypothesis can make sense from my general understanding.
55. NoPicklez ◴[] No.42168444[source]
I think Netflix have a fair bit of organisational muscle, perhaps the fight was considered not as large of an event as the NFL streams would be in the future.

Also, "No experience in" really? You have no idea if that's really the case

56. mbrumlow ◴[] No.42180037{5}[source]
I think once a week is good. Maybe once every two weeks.
57. kpierce ◴[] No.42200724{3}[source]
Yes I agree that fight had a great deal of interest but the nfl is their real goal.