QUIC is not quick enough over fast internet

(arxiv.org)

313 points carlos-menezes | 2 comments | 19 Oct 24 21:04 UTC | HN request time: 0.415s | source

Show context

cletus ◴[19 Oct 24 23:46 UTC] No.41891721[source]▶

At Google, I worked on a pure JS Speedtest. At the time, Ookla was still Flash-based so wouldn't work on Chromebooks. That was a problem for installers to verify an installation. I learned a lot about how TCP (I realize QUIC is UDP) responds to various factors.

I look at this article and consider the result pretty much as expected. Why? Because it pushes the flow control out of the kernel (and possibly network adapters) into userspace. TCP has flow-control and sequencing. QUICK makes you manage that yourself (sort of).

Now there can be good reasons to do that. TCP congestion control is famously out-of-date with modern connection speeds, leading to newer algorithms like BRR [1] but it comes at a cost.

But here's my biggest takeaway from all that and it's something so rarely accounted for in network testing, testing Web applications and so on: latency.

Anyone who lives in Asia or Australia should relate to this. 100ms RTT latency can be devastating. It can take something that is completely responsive to utterly unusable. It slows down the bandwidth a connection can support (because of the windows) and make it less responsive to errors and congestion control efforts (both up and down).

I would strongly urge anyone testing a network or Web application to run tests where they randomly add 100ms to the latency [2].

My point in bringing this up is that the overhead of QUIC may not practically matter because your effective bandwidth over a single TCP connection (or QUICK stream) may be MUCH lower than your actual raw bandwidth. Put another way, 45% extra data may still be a win because managing your own congestion control might give you higher effective speed over between two parties.

[1]: https://atoonk.medium.com/tcp-bbr-exploring-tcp-congestion-c...

[2]: https://bencane.com/simulating-network-latency-for-testing-i...

replies(11): >>41891766 #>>41891768 #>>41891919 #>>41892102 #>>41892118 #>>41892276 #>>41892709 #>>41893658 #>>41893802 #>>41894376 #>>41894468 #

klabb3 ◴[20 Oct 24 01:11 UTC] No.41892102[source]▶

>>41891721 #

I did a bunch of real world testing of my file transfer app[1]. Went in with the expectation that Quic would be amazing. Came out frustrated for many reasons and switched back to TCP. It’s obvious in hindsight, but with TCP you say “hey kernel send this giant buffer please” whereas UDP is packet switched! So even pushing zeroes has a massive CPU cost on most OSs and consumer hardware, from all the mode switches. Yes, there are ways around it but no they’re not easy nor ready in my experience. Plus it limits your choice of languages/libraries/platforms.

(Fun bonus story: I noticed significant drops in throughput when using battery on a MacBook. Something to do with the efficiency cores I assume.)

Secondly, quic does congestion control poorly (I was using quic-go so mileage may vary). No tuning really helped, and TCP streams would take more bandwidth if both were present.

Third, the APIs are weird man. So, quic itself has multiple streams, which makes it non-drop in replacement with TCP. However, the idea is to have HTTP/3 be drop-in replaceable at a higher level (which I can’t speak to because I didn’t do). But worth keeping in mind if you’re working on the stream level.

In conclusion I came out pretty much defeated but also with a newfound respect for all the optimizations and resilience of our old friend tcp. It’s really an amazing piece of tech. And it’s just there, for free, always provided by the OS. Even some of the main issues with tcp are not design faults but conservative/legacy defaults (buffer limits on Linux, Nagle, etc). I really just wish we could improve it instead of reinventing the wheel..

[1]: https://payload.app/

replies(2): >>41892805 #>>41893050 #

eptcyka ◴[20 Oct 24 05:16 UTC] No.41893050[source]▶

>>41892102 #

One does not need to send and should not send one packet per syscall.

replies(3): >>41894327 #>>41894736 #>>41895201 #

1. tomohawk ◴[20 Oct 24 11:50 UTC] No.41894736[source]▶

>>41893050 #

On linux, there is sendmmsg, which can send up to 1024 packets each time, but that is a far cry from a single syscall to send 1GB file. With GSO, it is possible to send even more datagrams to call, but the absolute limit is 64KB * 1024 per syscall, and it is fiddly to pack datagrams so that this works correctly.

You might think you can send datagrams of up to 64KB, but due to limitations in how IP fragment reassembly works, you really must do your best to not allow IP fragmentation to occur, so 1472 is the largest in most circumstances.

replies(1): >>41895858 #

2. Veserv ◴[20 Oct 24 15:17 UTC] No.41895858[source]▶

>>41894736 (TP) #

Why does 1 syscall per 1 GB versus 1 syscall per 1 MB have any meaningful performance cost?

syscall overhead is only on the order of 100-1000 ns. Even at a blistering per core memory bandwidth of 100 GB/s, just the single copy fundamentally needed to serialize 1 MB into network packets costs 10,000 ns.

The ~1,000 syscalls needed to transmit a 1 GB file would incur excess overhead of 1 ms versus 1 syscall per 1 GB.

That is at most a 10% overhead if the only thing your system call needs to do is copy the data. As in it takes 10,000 ns total to transmit 1,000 packets meaning you get 10 ns per packet to do all of your protocol segmentation and processing.

The benchmarks in the paper show that the total protocol execution time for a 1 GB file using TCP is 4 seconds. The syscall overhead for issuing 1,000 excess syscalls should thus be ~1/4000 or about 0.025% which is totally irrelevant.

The difference between the 4 second TCP number and the 8 second QUIC number can not be meaningfully traced back to excess syscalls if they were actually issuing max size sendmmsg calls. Hell, even if they did one syscall per packet that would still only account for a mere 1 second of the 4 second difference. It would be a stupid implementation for sure to have such unforced overhead, but even that would not be the actual cause of the performance discrepancy between TCP and QUIC in the produced benchmarks.

↑