←back to thread

492 points storf45 | 1 comments | | HN request time: 0.208s | source
Show context
dylan604 ◴[] No.42157048[source]
People just do not appreciate how many gotchas can pop up doing anything live. Sure, Netflix might have a great CDN that works great for their canned content and I could see how they might have assumed that's the hardest part.

Live has changed over the years from large satellite dishes beaming to a geosat and back down to the broadcast center($$$$$), to microwave to a more local broadcast center($$$$), to running dedicated fiber long haul back to a broadcast center($$$), to having a kit with multiple cell providers pushing a signal back to a broadcast center($$), to having a direct internet connection to a server accepting a live http stream($).

I'd be curious to know what their live plan was and what their redundant plan was.

replies(6): >>42157110 #>>42157117 #>>42157164 #>>42159101 #>>42159285 #>>42159954 #
colesantiago ◴[] No.42157164[source]
This is the whole point of chaos engineering that was invented at Netflix, which tests the resiliency of these systems.

I guess we now know the limits of what "at scale" is for Netflix's live-streaming solution. They shouldn't be failing at scale on a huge stage like this.

I look forward to reading the post mortem about this.

replies(1): >>42157426 #
dylan604 ◴[] No.42157426[source]
Everyone keeps mentioning at scale. I seriously doubt this was an "at scale" problem. I have strong suspicion this was a failure at the origination point being able to push a stable signal. That is not an "at scale" issue, but a hubris of we can do better/cheaper than broadcasting standard practices
replies(6): >>42157737 #>>42158523 #>>42159296 #>>42159379 #>>42159456 #>>42160379 #
1. kortilla ◴[] No.42159296[source]
I highly doubt this. Netflix has a system of OCAs that are loaded with hard disks, are installed in ISP’s networks, and serve the majority of those ISP’s customers.

Given than many people had no problems with the stream, it is unlikely to have been an origin problem but more likely the mechanism to fanout quickly to OCAs. Normally latency to an OCA doesn’t matter when you’re replicating new catalogs in advance, but live streaming makes a bunch of code that previously “didn’t need to be fast” get promoted to the hot path.