Most active commenters

astrange(3)

Popular/hot comments

>>41892802 #
>>41891766 #
>>41891768 #
>>41893050 #

←back to thread

QUIC is not quick enough over fast internet

(arxiv.org)

1. cletus ◴[19 Oct 24 23:46 UTC] No.41891721[source]▶

>>41890784 (OP) #

At Google, I worked on a pure JS Speedtest. At the time, Ookla was still Flash-based so wouldn't work on Chromebooks. That was a problem for installers to verify an installation. I learned a lot about how TCP (I realize QUIC is UDP) responds to various factors.

I look at this article and consider the result pretty much as expected. Why? Because it pushes the flow control out of the kernel (and possibly network adapters) into userspace. TCP has flow-control and sequencing. QUICK makes you manage that yourself (sort of).

Now there can be good reasons to do that. TCP congestion control is famously out-of-date with modern connection speeds, leading to newer algorithms like BRR [1] but it comes at a cost.

But here's my biggest takeaway from all that and it's something so rarely accounted for in network testing, testing Web applications and so on: latency.

Anyone who lives in Asia or Australia should relate to this. 100ms RTT latency can be devastating. It can take something that is completely responsive to utterly unusable. It slows down the bandwidth a connection can support (because of the windows) and make it less responsive to errors and congestion control efforts (both up and down).

I would strongly urge anyone testing a network or Web application to run tests where they randomly add 100ms to the latency [2].

My point in bringing this up is that the overhead of QUIC may not practically matter because your effective bandwidth over a single TCP connection (or QUICK stream) may be MUCH lower than your actual raw bandwidth. Put another way, 45% extra data may still be a win because managing your own congestion control might give you higher effective speed over between two parties.

[1]: https://atoonk.medium.com/tcp-bbr-exploring-tcp-congestion-c...

[2]: https://bencane.com/simulating-network-latency-for-testing-i...

replies(11): >>41891766 #>>41891768 #>>41891919 #>>41892102 #>>41892118 #>>41892276 #>>41892709 #>>41893658 #>>41893802 #>>41894376 #>>41894468 #

2. ec109685 ◴[19 Oct 24 23:54 UTC] No.41891766[source]▶

>>41891721 (TP) #

For reasonably long downloads (so it has a chance to calibrate), why don't congestion algorithms increase the number of inflight packets to a high enough number that bandwidth is fully utilized even over high latency connections?

It seems like it should never be the case that two parallel downloads will preform better than a single one to the same host.

replies(4): >>41891861 #>>41891874 #>>41891957 #>>41892726 #

3. skissane ◴[19 Oct 24 23:55 UTC] No.41891768[source]▶

>>41891721 (TP) #

> Because it pushes the flow control out of the kernel (and possibly network adapters) into userspace

That’s not an inherent property of the QUIC protocol, it is just an implementation decision - one that was very necessary for QUIC to get off the ground, but now it exists, maybe it should be revisited? There is no technical obstacle to implementing QUIC in the kernel, and if the performance benefits are significant, almost surely someone is going to do it sooner or later.

replies(3): >>41891946 #>>41891973 #>>41893160 #

4. Veserv ◴[20 Oct 24 00:13 UTC] No.41891861[source]▶

>>41891766 #

You can in theory. You just need a accurate model of your available bandwidth and enough buffering/storage to avoid stalls while you wait for acknowledgement. It is, frankly, not even that hard to do it right. But in practice many implementations are terrible, so good luck.

5. gmueckl ◴[20 Oct 24 00:16 UTC] No.41891874[source]▶

>>41891766 #

Larger windows can reduce the maximum number of simultaneous connections on the sender side.

6. api ◴[20 Oct 24 00:25 UTC] No.41891919[source]▶

>>41891721 (TP) #

A major problem with TCP is that the limitations of the kernel network stack and sometimes port allocation place absurd artificial limits on the number of active connections. A modern big server should be able to have tens of millions of open TCP connections at least, but to do that well you have to do hacks like running a bunch of pointless VMs.

replies(1): >>41892527 #

7. ants_everywhere ◴[20 Oct 24 00:32 UTC] No.41891946[source]▶

>>41891768 #

Is this something you could use ebpf for?

8. dan-robertson ◴[20 Oct 24 00:35 UTC] No.41891957[source]▶

>>41891766 #

There are two places a packet can be ‘in-flight’. One is light travelling down cables (or the electrical equivalent) or in memory being processed by some hardware like a switch, and the other is sat in a buffer in some networking appliance because the downstream connection is busy (eg sending packets that are further up the queue, at a slower rate than they arrive). If you just increase bandwidth it is easy to get lots of in-flight packets in the second state which increases latency (admittedly that doesn’t matter so much for long downloads) and the chance of packet loss from overly full buffers.

CUBIC tries to increase bandwidth until it hits packet loss, then cuts bandwidth (to drain buffers a bit) and ramps up and hangs around close to the rate that led to loss, before it tries sending at a higher rate and filling up buffers again. Cubic is very sensitive to packet loss, which makes things particularly difficult on very high bandwidth links with moderate latency as you need very low rates of (non-congestion-related) loss to get that bandwidth.

BBR tries to do the thing you describe while also modelling buffers and trying to keep them empty. It goes through a cycle of sending at the estimated bandwidth, sending at a lower rate to see if buffers got full, and sending at a higher rate to see if that’s possible, and the second step can be somewhat harmful if you don’t need the advantages of BBR.

I think the main thing that tends to prevent the thing you talk about is flow control rather than congestion control. In particular, the sender needs a sufficiently large send buffer to store all unacked data (which can be a lot due to various kinds of ack-delaying) in case it needs to resend packets, and if you need to resend some then your send buffer would need to be twice as large to keep going. On the receive size, you need big enough buffers to be able to fill up those buffers from the network while waiting for an earlier packet to be retransmitted.

On a high-latency fast connection, those buffers need to be big to get full bandwidth, and that requires (a) growing a lot, which can take a lot of round-trips, and (b) being allowed by the operating system to grow big enough.

9. lttlrck ◴[20 Oct 24 00:39 UTC] No.41891973[source]▶

>>41891768 #

For Linux that's true. But Microsoft never added SCTP to Windows; not being beholden to Microsoft and older OS must have been part of the calculus?

replies(2): >>41892046 #>>41892802 #

10. skissane ◴[20 Oct 24 00:57 UTC] No.41892046{3}[source]▶

>>41891973 #

> But Microsoft never added SCTP to Windows

Windows already has an in-kernel QUIC implementation (msquic.sys), used for SMB/CIFS and in-kernel HTTP. I don’t think it is accessible from user-space - I believe user-space code uses a separate copy of the same QUIC stack that runs in user-space (msquic.dll), but there is no reason in-principle why Microsoft couldn’t expose the kernel-mode implementation to user space

11. klabb3 ◴[20 Oct 24 01:11 UTC] No.41892102[source]▶

>>41891721 (TP) #

I did a bunch of real world testing of my file transfer app[1]. Went in with the expectation that Quic would be amazing. Came out frustrated for many reasons and switched back to TCP. It’s obvious in hindsight, but with TCP you say “hey kernel send this giant buffer please” whereas UDP is packet switched! So even pushing zeroes has a massive CPU cost on most OSs and consumer hardware, from all the mode switches. Yes, there are ways around it but no they’re not easy nor ready in my experience. Plus it limits your choice of languages/libraries/platforms.

(Fun bonus story: I noticed significant drops in throughput when using battery on a MacBook. Something to do with the efficiency cores I assume.)

Secondly, quic does congestion control poorly (I was using quic-go so mileage may vary). No tuning really helped, and TCP streams would take more bandwidth if both were present.

Third, the APIs are weird man. So, quic itself has multiple streams, which makes it non-drop in replacement with TCP. However, the idea is to have HTTP/3 be drop-in replaceable at a higher level (which I can’t speak to because I didn’t do). But worth keeping in mind if you’re working on the stream level.

In conclusion I came out pretty much defeated but also with a newfound respect for all the optimizations and resilience of our old friend tcp. It’s really an amazing piece of tech. And it’s just there, for free, always provided by the OS. Even some of the main issues with tcp are not design faults but conservative/legacy defaults (buffer limits on Linux, Nagle, etc). I really just wish we could improve it instead of reinventing the wheel..

[1]: https://payload.app/

replies(2): >>41892805 #>>41893050 #

12. pests ◴[20 Oct 24 01:13 UTC] No.41892118[source]▶

>>41891721 (TP) #

The Network tab in the Chrome console allows you to degrade your connection. There are presets for Slow/Fast 4G, 3G, or you can make a custom present where you can specify download and upload speeds, latency in ms, a packet loss percent, a packet queue length and can enable packet reordering.

replies(2): >>41892287 #>>41894505 #

13. reshlo ◴[20 Oct 24 01:49 UTC] No.41892276[source]▶

>>41891721 (TP) #

> Anyone who lives in Asia or Australia should relate to this. 100ms RTT latency can be devastating.

When I used to (try to) play online games in NZ a few years ago, RTT to US West servers sometimes exceeded 200ms.

replies(2): >>41892498 #>>41893624 #

14. lelandfe ◴[20 Oct 24 01:51 UTC] No.41892287[source]▶

>>41892118 #

There's also an old macOS preference pane called Network Link Conditioner that makes the connections more realistic: https://nshipster.com/network-link-conditioner/

IIRC, Chrome's network simulation just applies a delay after a connection is established

replies(1): >>41893107 #

15. indrora ◴[20 Oct 24 02:41 UTC] No.41892498[source]▶

>>41892276 #

When I was younger, I played a lot of cs1.6 and hldm. Living in rural New Mexico, my ping times were often 150-250ms.

DSL kills.

replies(1): >>41893340 #

16. toast0 ◴[20 Oct 24 02:53 UTC] No.41892527[source]▶

>>41891919 #

> A modern big server should be able to have tens of millions of open TCP connections at least, but to do that well you have to do hacks like running a bunch of pointless VMs.

Inbound connections? You don't need to do anything other than make sure your fd limit is high and maybe not be ipv4 only and have too many users behind the same cgnat.

Outbound connections is harder, but hopefully you don't need millions of connections to the same destination, or if you do, hopefully they support ipv6.

When I ran millions of connections through HAproxy (bare tcp proxy, just some peaking to determine the upstream), I had to do a bunch of work to make it scale, but not because of port limits.

17. bdd8f1df777b ◴[20 Oct 24 03:46 UTC] No.41892709[source]▶

>>41891721 (TP) #

As a Chinese whose latency to servers outside China often exceeds 300ms, I'm a staunch supporter of QUIC. The difference is night and day.

18. toast0 ◴[20 Oct 24 03:51 UTC] No.41892726[source]▶

>>41891766 #

I've run a big webserver that served a decent size apk/other app downloads (and a bunch of small files and what nots). I had to set the maximum outgoing window to keep the overall memory within limits.

IIRC, servers were 64GB of ram and sendbufs were capped at 2MB. I was also dealing with a kernel deficiency that would leave the sendbuf allocated if the client disappeared in LAST_ACK. (This stems from a deficiency in the state description from the 1981 rfc written before my birth)

replies(1): >>41894769 #

19. astrange ◴[20 Oct 24 04:11 UTC] No.41892802{3}[source]▶

>>41891973 #

No one ever uses SCTP. It's pretty unclear to me why any OSes do include it; free OSes seem to like junk drawers of network protocols even though they add to the security surface in kernel land.

replies(5): >>41892937 #>>41892986 #>>41893372 #>>41893981 #>>41895474 #

20. astrange ◴[20 Oct 24 04:12 UTC] No.41892805[source]▶

>>41892102 #

> (Fun bonus story: I noticed significant drops in throughput when using battery on a MacBook. Something to do with the efficiency cores I assume.)

That sounds like the thread priority/QoS was incorrect, but it could be WiFi or something.

21. kelnos ◴[20 Oct 24 04:49 UTC] No.41892937{4}[source]▶

>>41892802 #

Does anyone even build SCTP support directly into the kernel? Looks like Debian builds it as a module, which I'm sure I never have and never will load. Security risk seems pretty minimal there.

(And if someone can somehow coerce me into loading it, I have bigger problems.)

replies(2): >>41893439 #>>41895319 #

22. supriyo-biswas ◴[20 Oct 24 05:02 UTC] No.41892986{4}[source]▶

>>41892802 #

The telecom sector uses SCTP in lots of places.

23. eptcyka ◴[20 Oct 24 05:16 UTC] No.41893050[source]▶

>>41892102 #

One does not need to send and should not send one packet per syscall.

replies(3): >>41894327 #>>41894736 #>>41895201 #

24. mh- ◴[20 Oct 24 05:31 UTC] No.41893107{3}[source]▶

>>41892287 #

I don't remember the details offhand, but yes - unless Chrome's network simulation has been rewritten in the last few years, it doesn't do a good job of approximating real world network conditions.

It's a lot better than nothing, and doing it realistically would be a lot more work than what they've done, so I say this with all due respect to those who worked on it.

25. conradev ◴[20 Oct 24 05:45 UTC] No.41893160[source]▶

>>41891768 #

Looks like it’s being worked on: https://lwn.net/Articles/989623/

replies(1): >>41896868 #

26. somat ◴[20 Oct 24 06:25 UTC] No.41893340{3}[source]▶

>>41892498 #

I used to play netquake(not quakeworld) at up to 800 ms lag, past that was too much for even young stupid me.

For them that don't know the difference. netquake was the original strict client server version of quake, you hit the forward key it sends that to the server and the server then sends back where you moved. quakeworld was the client side prediction enhancement that came later, you hit forward, the client moves you forwards and sends it to the server at the same time. and if there are differences it gets reconciled later.

For the most part client side prediction feels better to play. however when there are network problems, large amounts of lag, a lot of artifacts start to show up, rubberbanding, jumping around, hits that don't. Pure client server feels worse, every thing gets sluggish, and mushy but movement is a little more predictable and logical and can sort of be anticipated.

I have not played quake in 20 years but one thing I remember is at past 800ms of lag the lava felt magnetic, it would just suck you in, every time.

27. spookie ◴[20 Oct 24 06:33 UTC] No.41893372{4}[source]▶

>>41892802 #

And most of those protocols can be disabled under sysctl.conf.

28. jeroenhd ◴[20 Oct 24 06:49 UTC] No.41893439{5}[source]▶

>>41892937 #

Linux and FreeBSD have had it for ages. Anything industrial too. Solaris, QNX, Cisco IOS.

SCTP is essential for certain older telco protocols and in certain protocols developed for LTE it was added. End users probably don't use it much, but the harsware their connections are going through will speak SCTP at some level.

29. albertopv ◴[20 Oct 24 07:30 UTC] No.41893624[source]▶

>>41892276 #

I would be surprised if online games use TCP. Anyway, physics is still there and light speed is fast, but that much. In 10ms it travels about 3000km, NZ to US west coast is about 11000km, so less than 60ms is impossible. Cables are probably much longer, c speed is lower in a medium, add network devices latency and 200ms from NZ to USA is not that bad.

replies(2): >>41894331 #>>41901615 #

30. attentive ◴[20 Oct 24 07:40 UTC] No.41893658[source]▶

>>41891721 (TP) #

> I look at this article and consider the result pretty much as expected. Why? Because it pushes the flow control out of the kernel (and possibly network adapters) into userspace. TCP has flow-control and sequencing. QUICK makes you manage that yourself (sort of).

This implies that user space is slow. Yet, some(most?) of the fastest high-performance TCP/IP stacks are made in user space.

replies(2): >>41893816 #>>41893862 #

31. pzmarzly ◴[20 Oct 24 08:15 UTC] No.41893802[source]▶

>>41891721 (TP) #

I truly hope the QUIC in Linux Kernel project [0] succeeds. I'm not looking forward to linking big HTTP/3 libraries to all applications.

[0] https://github.com/lxin/quic

replies(1): >>41896924 #

32. WesolyKubeczek ◴[20 Oct 24 08:17 UTC] No.41893816[source]▶

>>41893658 #

You have to jump contexts for every datagram, and you cannot offload checksumming to the network hardware.

33. formerly_proven ◴[20 Oct 24 08:24 UTC] No.41893862[source]▶

>>41893658 #

If the entire stack is in usermode and it's directly talking to the NIC with no kernel involvement beyond setup at all. This isn't the case with QUIC, it uses the normal sockets API to send/recv UDP.

34. lstodd ◴[20 Oct 24 08:49 UTC] No.41893981{4}[source]▶

>>41892802 #

4g/LTE runs on it. So you use it too, via your phone.

replies(1): >>41894177 #

35. astrange ◴[20 Oct 24 09:30 UTC] No.41894177{5}[source]▶

>>41893981 #

Huh, didn't know that. But iOS doesn't support it, so it's not needed on the AP side even for wifi calling.

36. jacobgorm ◴[20 Oct 24 10:08 UTC] No.41894327{3}[source]▶

>>41893050 #

On platforms like macOS that don’t have UDP packet pacing you more or less have to.

replies(1): >>41929011 #

37. Hikikomori ◴[20 Oct 24 10:10 UTC] No.41894331{3}[source]▶

>>41893624 #

Speed of light in fiber is about 200 000km/s. Most of the latency is because of distance, modern routers have a forwarding latency of tens of microseconds, some switches can start sending out a packet before fully receiving it.

38. superjan ◴[20 Oct 24 10:22 UTC] No.41894376[source]▶

>>41891721 (TP) #

As an alternative to simulating latency: How about using a VPN service to test your website via Australia? I suppose that when it easier to do, it is more likely that people will actually do this test.

replies(1): >>41894443 #

39. sokoloff ◴[20 Oct 24 10:35 UTC] No.41894443[source]▶

>>41894376 #

That’s going to give you double (plus a bit) latency as your users in Australia will experience.

replies(1): >>41894545 #

40. Tade0 ◴[20 Oct 24 10:43 UTC] No.41894468[source]▶

>>41891721 (TP) #

I've been tasked with improving a system where a lot of the events relied on timing to be just right, so now I routinely click around the app with a 900ms delay, as that's the most that I can get away with without having the hot-reloading system complain.

Plenty of assumptions break down in such an environment and part of my work is to ensure that the user always knows that the app is really doing something and not just being unresponsive.

41. youngtaff ◴[20 Oct 24 10:50 UTC] No.41894505[source]▶

>>41892118 #

Chrome’s network emulation is a pretty poor simulation of the real world… it throttles on a per request basis so can’t simulate congestion due to multiple requests in flight at the same time

Really need something like ipfw, dummynet, tc etc to do it at the packet level

42. codetrotter ◴[20 Oct 24 11:04 UTC] No.41894545{3}[source]▶

>>41894443 #

Rent a VPS or physical server in Australia. Then you will have approx the same latency accessing that dev server, that the Australians have reaching servers in your country.

43. tomohawk ◴[20 Oct 24 11:50 UTC] No.41894736{3}[source]▶

>>41893050 #

On linux, there is sendmmsg, which can send up to 1024 packets each time, but that is a far cry from a single syscall to send 1GB file. With GSO, it is possible to send even more datagrams to call, but the absolute limit is 64KB * 1024 per syscall, and it is fiddly to pack datagrams so that this works correctly.

You might think you can send datagrams of up to 64KB, but due to limitations in how IP fragment reassembly works, you really must do your best to not allow IP fragmentation to occur, so 1472 is the largest in most circumstances.

replies(1): >>41895858 #

44. dan-robertson ◴[20 Oct 24 11:58 UTC] No.41894769{3}[source]▶

>>41892726 #

I wonder if there’s some way to reduce this server-side memory requirement. I thought that was part of the point of sendfile but I might be mistaken. Unfortunately sendfile isn’t so suitable nowadays because of tls. But maybe if you could do tls offload and do sendfile then an OS could be capable of needing less memory for sendbufs.

45. intelVISA ◴[20 Oct 24 13:20 UTC] No.41895201{3}[source]▶

>>41893050 #

Anyone pushing packets seriously doesn't even use syscalls...

replies(1): >>41929009 #

46. rjsw ◴[20 Oct 24 13:43 UTC] No.41895319{5}[source]▶

>>41892937 #

I added it to NetBSD and build it into my kernels, it isn't enabled by default though.

Am part way through adding NAT support for it to the firewall.

47. j1elo ◴[20 Oct 24 14:08 UTC] No.41895474{4}[source]▶

>>41892802 #

SCTP is exactly how you establish a data communication link with the very modern WebRTC protocol stack (and is rebranded to "WebRTC Data Channels"). Granted, it is SCTP-over-UDP. But still.

So yes, SCTP is under the covers getting a lot more use than it seems, still today. However all WebRTC implementations usually bring their own userspace libraries to implement SCTP, so they don't depend on the one from the OS.

48. Veserv ◴[20 Oct 24 15:17 UTC] No.41895858{4}[source]▶

>>41894736 #

Why does 1 syscall per 1 GB versus 1 syscall per 1 MB have any meaningful performance cost?

syscall overhead is only on the order of 100-1000 ns. Even at a blistering per core memory bandwidth of 100 GB/s, just the single copy fundamentally needed to serialize 1 MB into network packets costs 10,000 ns.

The ~1,000 syscalls needed to transmit a 1 GB file would incur excess overhead of 1 ms versus 1 syscall per 1 GB.

That is at most a 10% overhead if the only thing your system call needs to do is copy the data. As in it takes 10,000 ns total to transmit 1,000 packets meaning you get 10 ns per packet to do all of your protocol segmentation and processing.

The benchmarks in the paper show that the total protocol execution time for a 1 GB file using TCP is 4 seconds. The syscall overhead for issuing 1,000 excess syscalls should thus be ~1/4000 or about 0.025% which is totally irrelevant.

The difference between the 4 second TCP number and the 8 second QUIC number can not be meaningfully traced back to excess syscalls if they were actually issuing max size sendmmsg calls. Hell, even if they did one syscall per packet that would still only account for a mere 1 second of the 4 second difference. It would be a stupid implementation for sure to have such unforced overhead, but even that would not be the actual cause of the performance discrepancy between TCP and QUIC in the produced benchmarks.

49. throawayonthe ◴[20 Oct 24 17:11 UTC] No.41896868{3}[source]▶

>>41893160 #

also looks like current quic performance issues are a consideration, tested in section 4. :

> The performance gap between QUIC and kTLS may be attributed to:

  - The absence of Generic Segmentation Offload (GSO) for QUIC.
  - An additional data copy on the transmission (TX) path.
  - Extra encryption required for header protection in QUIC.
  - A longer header length for the stream data in QUIC.

50. reshlo ◴[21 Oct 24 07:34 UTC] No.41901615{3}[source]▶

>>41893624 #

The total length of the relevant sections of the Southern Cross Cable is 12,135km, as it goes via Hawaii.

The main reason I made my original comment was to point out that the real numbers are more than double what the other commenter called “devastating” latency.

https://en.wikipedia.org/wiki/Southern_Cross_Cable

51. rofrol ◴[23 Oct 24 20:30 UTC] No.41929009{4}[source]▶

>>41895201 #

so what one uses?

replies(1): >>41944112 #

52. rofrol ◴[23 Oct 24 20:30 UTC] No.41929011{4}[source]▶

>>41894327 #

how so?

replies(1): >>41939881 #

53. jacobgorm ◴[24 Oct 24 21:16 UTC] No.41939881{5}[source]▶

>>41929011 #

Sending all the packets in one call is likely to lead to router buffers filing up, causing packet drop. Linux lets you send everything in a single call with the kernel spacing out the actual sends.

54. 392 ◴[25 Oct 24 11:10 UTC] No.41944112{5}[source]▶

>>41929009 #

"userspace networking"

↑