←back to thread

492 points storf45 | 10 comments | | HN request time: 0.949s | source | bottom
Show context
grogenaut ◴[] No.42160548[source]
This topic is really just fun for me to read based on where I work and my role.

Live is a lot harder than on demand especially when you can't estimate demand (which I'm sure this was hard to do). People are definitely not understanding that. Then there is that Netflix is well regarded for their engineering not quite to the point of snobbery.

What is actually interesting to me is that they went for an event like this which is very hard to predict as one of their first major forays into live, instead of something that's a lot easier to predict like a baseball game / NFL game.

I have to wonder if part of the NFL allowing Netflix to do the Christmas games was them proving out they could handle live streams at least a month before. The NFL seems to be quite particular (in a good way) about the quality of the delivery of their content so I wouldn't put it past them.

replies(3): >>42160748 #>>42160770 #>>42160867 #
1. devit ◴[] No.42160867[source]
Why is live a lot harder?

Aside from latency (which isn't much of a problem unless you are competing with TV or some other distribution system), it seems easier than on-demand, since you send the same data to everyone and don't need to handle having a potentially huge library in all datacenters (you have to distribute the data, but that's just like having an extra few users per server).

My guess is that the problem was simply that the number of people viewing Netflix at once in the US was much larger than usual and higher than what they could scale too, or alternatively a software bug was triggered.

replies(4): >>42161026 #>>42161045 #>>42161084 #>>42161376 #
2. michaelt ◴[] No.42161026[source]
Latency is somewhat important for huge sporting events; you don't want every tense moment spoiled by the cheers of your neighbours whose feed is 20 seconds ahead.

With on-demand you can push the episodes out through your entire CDN at your leisure. It doesn't matter if some bottleneck means it takes 2 hours to distribute a 1 hour show worldwide, if you're distributing it the day before. And if you want to test, or find something that needs fixing? You've got plenty of time.

And on-demand viewers can trickle in gradually - so if clients have to contact your DRM servers for a new key every 15 minutes, they won't all be doing it at the same moment.

And if you did have a brief hiccup with your DRM servers - could you rely on the code quality of abandonware Smart TV clients to save you?

replies(2): >>42161371 #>>42163609 #
3. MattDaEskimo ◴[] No.42161045[source]
I'm not an expert in this, but at least familiar with the trade.

I'd imagine with on-demand services you already have the full content and therefore can use algorithms to compress frames and perform all kinds of neat tricks to.

With live streaming I'd imagine a lot of these algorithms are useless as there isn't enough delay & time to properly use them, so they're required to stream every single pixel and maybe some JIT algorithms

replies(1): >>42161378 #
4. nemothekid ◴[] No.42161084[source]
On demand is easier precisely because having a huge library in all data centers is relatively cheap. In actuality you just have a cache, collocated ISPs that pulls from your origin servers. Likely you have users all watching different things so you can easily avoid hot spots by sharding on the content type. Once the in demand content is in the cache its' relatively easy to serve.

Live content is harder because it can't really be cached, nor, due to TLS, can you really serve everyone the same stream. I think the hardest problem to solve is provisioning. If you are expecting 1 million users, and 700,000 of them get routed to a single server, that server will begin to struggle. This can happen in a couple different ways - for example an ISP who isn't a large consumer normally, suddenly overloads its edge server. Even though your DC can handle the traffic just fine, the links between your DC and the ISP begin to suffer, and since the even is live, it's not like you can just wait until the cache is filled downstream.

replies(1): >>42188367 #
5. MBCook ◴[] No.42161371[source]
That has been a big problem for football, especially things like the Super Bowl.

People using over the air antennas get it “live“. Getting it from cable or a streaming service meant anywhere between a few seconds and over a minute of delay.

It was absolutely common to have a friend text you about something that just happened when you haven’t even seen it yet.

You can’t even say that $some_service is fast, some of them vary over 60 seconds just between their own users.

https://www.phenixrts.com/resource/super-bowl-2024

6. avidiax ◴[] No.42161376[source]
I'm in an adjacent space, so I can imagine some of the difficulties. Basically live streaming is a parallel infrastructure that shares very little with pre-recorded streaming, and there are many failure points.

* Encoding - low latency encoders are quite different than storage encoders. There is a tradeoff to be made in terms of the frequency of key frames vs. overall encoding efficiency. More key frames means that anyone can tune in or recover from a loss more quickly, but it is much less efficient, reducing quality. The encoder and infrastructure should emit transport streams, which are also less efficient but more reliable than container formats like mp4.

* Adaptation - Netflix normally encodes their content as a ladder of various codecs and bitrates. This ensures that people get roughly the maximum quality that their bandwidth will allow without buffering. For a live event, you need the same ladder, and the clients need to switch between rungs invisibly.

* Buffering - for static content, you can easily buffer 30 seconds to a minute of video. This means that small latency or packet loss spikes are handled invisibly at the transport/buffering layer. You can't do this for a live event, since that level of delay would usually be unacceptable for a sporting event. You may only be able to buffer 5-10 seconds. If the stream starts to falter, the client has only a few seconds to detect and shift to a lower rung.

* Transport - Prerecorded media can use a reliable transport like TCP (usually HLS). In contrast, live video would ideally use an unreliable transport like UDP, but with FEC (forward error correction). TCP's reaction to packet loss halves the receive window, which halves bandwidth, which would have to trash the connection to shift to a lower bandwidth rung.

* Serving - pre-recorded media can be synchronized to global DCs. Live events have to be streamed reliably and redundantly to a tree of servers. Those servers need to be load balanced, and the clients must implement exponential backoff or you can have cascading failures.

* Timing - Unlike pre-recorded media, any client that has a slightly fast clock will run out of frames and either need to repeat frames and stretch audio, or suffer glitches. If you resolve this on the server side by stretching the media, you will add complication and your stream will slowly get behind the live event.

* DVR - If you allow the users to pause, rewind, catch up, etc., you now have a parallel pre-recorded infrastructure and the client needs to transition between the two.

* DRM - I have no idea how/if this works on a live stream. It would not be ideal that all clients use the same decryption keys and have the same streams with the same metadata. That would make tracing the source of a pirate stream very difficult. Differentiation/watermarking adds substantial complexity, however.

7. MBCook ◴[] No.42161378[source]
People are always impressed that Netflix can stand up to a new episode of Squid Game being released. And it’s not easy, we’ve seen HBO failed to handle Game of Thrones for example.

But in either case, you can put that stuff on your CDN days ahead of time. You can choose to preload it in the cache because you know a bunch of people are gonna want it. You also know that not every single individual is going to start at the exact same time.

For live, every single person wants every single bite at the same time and you can’t preload anything. Brutal.

8. omnee ◴[] No.42163609[source]
Latency between the live TV signal for my neighbours and the BBC iPlayer app I was using to watch the Euro 2024 final literally ruined the main moments for me. It still remains an unsolved issue long into the advent of live streaming.
9. pas ◴[] No.42188367[source]
... what do you mean it cannot be cached?

isn't it a tree of cache servers? as origin sends the frames they're cached.

and as load grows the tree has to grow too, and when it cannot resorting to degrading bitrate, and ultimately to load shedding to keep the viewers happy?

and it seems Netflix opted to forego the last one to avoid a the bad PR of an error message of "we are over capacity" and instead went with actually let it burn, no?

replies(1): >>42197487 #
10. nemothekid ◴[] No.42197487{3}[source]
>... what do you mean it cannot be cached?

When I mean "cached", it means that the PoP server can serve content without contacting the origin server. (The PoP can't serve content it does not have).

>and it seems Netflix opted to forego the last one to avoid a the bad PR of an error message of "we are over capacity" and instead went with actually let it burn, no?

Anything other than 100% uptime is bad PR for Netflix.