Most active commenters

kikimora(3)

WebSockets cost us $1M on our AWS bill

(www.recall.ai)

1. trollied ◴[06 Nov 24 21:16 UTC] No.42069524[source]▶

>>42067275 (OP) #

>In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.

> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.

Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.

replies(5): >>42069956 #>>42070181 #>>42070248 #>>42070804 #>>42070811 #

2. karamanolev ◴[06 Nov 24 21:45 UTC] No.42069956[source]▶

>>42069524 (TP) #

I fail to see how TCP/IP fragmentation really affects this use case. I don't know why it's mentioned and given that there aren't multiple network devices with different MTUs it will cause issues. Am I right? Is that the lack of technical knowledge you're referring to or am I missing something?

replies(1): >>42069979 #

3. drowsspa ◴[06 Nov 24 21:47 UTC] No.42069979[source]▶

>>42069956 #

Sounds weird that apparently they expected to send 3 MB in a single TCP packet

replies(2): >>42070420 #>>42074549 #

4. maxmcd ◴[06 Nov 24 22:04 UTC] No.42070181[source]▶

>>42069524 (TP) #

Please explain?

5. hathawsh ◴[06 Nov 24 22:09 UTC] No.42070248[source]▶

>>42069524 (TP) #

Why do you say that? Their solution of using shared memory (structured as a ring buffer) sounds perfect for their use case. Bonus points for using Rust to do it. How would you do it?

Edit: I guess perhaps you're saying that they don't know all the networking configuration knobs they could exercise, and that's probably true. However, they landed on a more optimal solution that avoided networking altogether, so they no longer had any need to research network configuration. I'd say they made the right choice.

replies(2): >>42070439 #>>42076768 #

6. bcrl ◴[06 Nov 24 22:21 UTC] No.42070420{3}[source]▶

>>42069979 #

Modern NICs will do that for you via a feature called TSO -- TCP Segmentation Offload.

More shocking to me is that anyone would attempt to run network throughput oriented software inside of Chromium. Look at what Cloudflare and Netflix do to get an idea what direction they should really be headed in.

replies(1): >>42071388 #

7. maxmcd ◴[06 Nov 24 22:22 UTC] No.42070439[source]▶

>>42070248 #

Yes, maybe they're talking about this: https://en.wikipedia.org/wiki/TCP_window_scale_option

8. adamrezich ◴[06 Nov 24 22:51 UTC] No.42070804[source]▶

>>42069524 (TP) #

This reminds me of when I was first starting to learn “real game development” (not using someone else's engine)—I was using C#/MonoGame, and, while having no idea what I was doing, decided I wanted vector graphics. I came across libcairo, figured out how to use it, set it all up correctly and everything… and then found that, whoops, sending 1920x1080x4 bytes to your GPU to render, 60 times a second, doesn't exactly work—for reasons that were incredibly obvious, in retrospect! At least it didn't cost me a million bucks to learn from my mistake.

replies(1): >>42077690 #

9. lttlrck ◴[06 Nov 24 22:51 UTC] No.42070811[source]▶

>>42069524 (TP) #

The article reads a like a personal "learn by doing" blog post.

10. oefrha ◴[06 Nov 24 23:44 UTC] No.42071388{4}[source]▶

>>42070420 #

They use Chromium (or any other browser) not out of choice but because they have to in order to participate in third party video conference sessions. Of course it’s best to reverse engineer the video conferencing clients and do HTTP requests directly without a headless browser, but I presume they’ve tried that and it’s very difficult, not to mention prone to breaking at any moment.

What’s surprising to me is they can’t access the compressed video on the wire and have to send decoded raw video. But presumably they’ve thought about that too.

replies(1): >>42074163 #

11. dmazzoni ◴[07 Nov 24 06:51 UTC] No.42074163{5}[source]▶

>>42071388 #

I'm assuming it's because the compressed video on the wire is encrypted?

12. ahoka ◴[07 Nov 24 07:57 UTC] No.42074549{3}[source]▶

>>42069979 #

Especially considering there are no packets in TCP.

replies(1): >>42086080 #

13. kikimora ◴[07 Nov 24 14:09 UTC] No.42076768[source]▶

>>42070248 #

> Why do you say that?

This is because reading how they came up with the solution it is clear they have little understanding how low level stuff works. For example, they surprised by the amount of data, that TCP packets are not the same as application level packets or frames, etc.

As for ring buffer design I’m not sure I understand their solution. Article mentions media encoder runs in a separate process. Chromium threads live in their processes (afaik) as well. But ring buffer requirement says “lock free” which only make sense inside a single process.

replies(2): >>42086021 #>>42095669 #

14. namibj ◴[07 Nov 24 15:49 UTC] No.42077690[source]▶

>>42070804 #

The sending is fine; cairo just won't create these bitmaps fast enough.

replies(1): >>42082926 #

15. adamrezich ◴[08 Nov 24 00:48 UTC] No.42082926{3}[source]▶

>>42077690 #

Was this true back in 2011 or so? I'm genuinely curious—this may be yet another layer of me having no idea of what I was doing at the time, but I thought I remember determining (somehow) that the problem was the CPU-to-GPU bottleneck. It may have been that I got 720p 30FPS working just fine, but then 1080p was in the single digits, and I just made a bad assumption, or something.

replies(1): >>42083542 #

16. jmb99 ◴[08 Nov 24 02:34 UTC] No.42083542{4}[source]▶

>>42082926 #

1080p@60 is “only” around 500MB/s, which should have been possible a decade ago. PCIe 1.0 x16 bandwidth maxed out at 4GB/s, so even if you weren’t on a top of the line system with PCIe 2.0 (or brand new 3.0!) you should have been fine on that front[1].

More than likely the CPU wasn’t able to keep up. The pipeline was likely generating a frame, storing it to memory, copying from memory to the PCIe device memory, displaying the frame, then generating the next frame. It wouldn’t surprise me if a ~2010 era CPU struggled doing so.

[1] Pretty much any GPU’s memory bandwidth is going to be limited by link speed. An 8800GTS 320MB from 2007 had a theoretical memory bandwidth of around 64GB/s, for reference.

17. rstuart4133 ◴[08 Nov 24 11:33 UTC] No.42086021{3}[source]▶

>>42076768 #

> But ring buffer requirement says “lock free” which only make sense inside a single process.

No, "lock free" is a thing that's nice to have when you've got two threads accessing the same memory. It doesn't matter if those two threads are in the same process or it's two different processes accessing the same memory. It's almost certainly two different processes in this case, and the shared memory is probably memory mapped file.

Whatever it is, the shared memory approach is going to be much faster using the kernel to ship the data between the two processes. Via the kernel means two copies, and probably two syscalls as well.

replies(1): >>42163399 #

18. rstuart4133 ◴[08 Nov 24 11:45 UTC] No.42086080{4}[source]▶

>>42074549 #

There are no packets on the user's API. But under the hood everything is sent in packets, numbered, ACK'ed and checksumed. The maximum packet size supported by IP is 64KB, as they say. I'm surprised the kernel supports that because I'm not aware of any real device that supports packets that big (Ethernet Jumbo Frames are only 9KB), but I guess it must.

19. evoke4908 ◴[09 Nov 24 17:41 UTC] No.42095669{3}[source]▶

>>42076768 #

"Lock-free" does not in any way imply a single process. Quite the opposite. We don't call single-thread code lock-free because all single-thread code is lock free by definition. You kind of can't use locks at all in this context, so it makes no sense to describe it as lock-free. This is like gluten-free water, complete nonsense.

Lock-free code is designed for concurrent access, but using some clever tricks to handle synchronization between processes without actually invoking a lock. Lock-free explicitly means parallel.

replies(1): >>42163373 #

20. kikimora ◴[17 Nov 24 10:54 UTC] No.42163373{4}[source]▶

>>42095669 #

I’m talking about single process with multiple threads where lock free makes sense.

21. kikimora ◴[17 Nov 24 10:58 UTC] No.42163399{4}[source]▶

>>42086021 #

I understand you can setup a data structure in shared memory and use lock free instructions to access it. However, I have never seen this is done in practice due to complexity. One particularly complicated scenario that comes to mind is dealing with unexpected process failures. This is quite different to dealing with exceptions in a thread.

↑