How to escape the Linux networking stack

(blog.cloudflare.com)

Show context

notepad0x90 ◴[17 Nov 25 20:18 UTC] No.45957801[source]▶

I'm slightly surprised cloudflare isn't using a userspace tcp/ip stack already (faster - less context switches and copies). It's the type of company I'd expect to actually need one.

replies(2): >>45958128 #>>45959181 #

Droobfest ◴[17 Nov 25 20:51 UTC] No.45958128[source]▶

>>45957801 #

From 2016: https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...

replies(1): >>45958213 #

1. notepad0x90 ◴[17 Nov 25 20:59 UTC] No.45958213[source]▶

>>45958128 #

Nice, they know better. But it also makes me wonder, because they're saying "but what if you need to run another app", I'd expect for things like loadbalancers for example, you'd only run one app per server on the data plane, the user space stack handles that, and the OS/services use a different control plane NIC with the kernel stack so that boxes are reachable even if there is link saturation, ddos,etc..

It also makes me wonder, why is tcp/ip special? The kernel should expose a raw network device. I get physical or layer 2 configuration happening in the kernel, but if it is supposed to do IP, then why stop there, why not TLS as well? Why run a complex network protocol stack in the kernel when you can just expose a configured layer 2 device to a user space process? It sounds like "that's just the way it's always been done" type of a scenario.

replies(3): >>45958565 #>>45959377 #>>45960224 #

2. wmf ◴[17 Nov 25 21:31 UTC] No.45958565[source]▶

>>45958213 (TP) #

AFAIK Cloudflare runs their whole stack on every machine. I guess that gives them flexibility and maybe better load balancing. They also seem to use only one NIC.

why is tcp/ip special? The kernel should expose a raw network device. ... Why run a complex network protocol stack in the kernel when you can just expose a configured layer 2 device to a user space process?

Check out the MIT Exokernel project and Solarflare OpenOnload that used this approach. It never really caught on because the old school way is good enough for almost everyone.

why stop there, why not TLS as well?

kTLS is a thing now (mostly used by Netflix). Back in the day we also had kernel-mode Web servers to save every cycle.

replies(1): >>45960187 #

3. rcxdude ◴[17 Nov 25 22:55 UTC] No.45959377[source]▶

>>45958213 (TP) #

You can do that if you want, but I think part of why tcp/ip is a useful layer of abstraction is it allows more robust boundaries between applications that may be running on the same machine. If you're just at layer 2 you are basically acting in behalf of the whole box.

4. bbarnett ◴[18 Nov 25 00:51 UTC] No.45960187[source]▶

>>45958565 #

Was it Tux? I've only used it, a looong time ago, on load balancers.

https://en.wikipedia.org/wiki/TUX_web_server

5. hansvm ◴[18 Nov 25 00:57 UTC] No.45960224[source]▶

>>45958213 (TP) #

TCP/IP is, in theory (AFAIK all experiments related to this fizzled out a decade or two ago), a global resource when you start factoring in congestion control. TLS is less obviously something you would want kernel involvement from, give or take the idea of outsourcing crypto to the kernel or some small efficiency gains for some workloads by skipping userspace handoffs, with more gains possible with NIC support.

replies(2): >>45960346 #>>45961233 #

6. Veserv ◴[18 Nov 25 01:22 UTC] No.45960346[source]▶

>>45960224 #

You do want to offload crypto to dedicated hardware otherwise your transport will get stuck at a paltry 40-50 Gb/s per core. However, you do not need more than block decryption; you can leave all of the crypto protocol management in userspace with no material performance impact.

7. notepad0x90 ◴[18 Nov 25 04:00 UTC] No.45961233[source]▶

>>45960224 #

why can't it be global and user space? DNS resolution for example is done by user space, and it is global.

replies(1): >>45962879 #

8. 1718627440 ◴[18 Nov 25 08:53 UTC] No.45962879{3}[source]▶

>>45961233 #

DNS isn't a shared resource, that needs to be managed and distributed fairly, among programs that don't trust and cooperate with each others.

replies(1): >>45969650 #

9. notepad0x90 ◴[18 Nov 25 17:52 UTC] No.45969650{4}[source]▶

>>45962879 #

DNS resolution is a shared resource. The DNS client is typically a user-space OS service that resolves and caches DNS requests. What is resolved by one application is cached and reused by another. But at the app level, there are is no deconflicting happening like transport layer protocols. However, the same can be said about IP, IP addresses like name servers are configured system wide and shared by all apps.

replies(1): >>45969780 #

10. 1718627440 ◴[18 Nov 25 18:01 UTC] No.45969780{5}[source]▶

>>45969650 #

It can be shared access to a cache, but this is an implementation detail for performance reasons. There is no problem with having different processes resolve DNS with different code. There is a problem if two processes want to control the same IP address, or manage the same TCP port.

replies(1): >>45974584 #

11. notepad0x90 ◴[19 Nov 25 01:06 UTC] No.45974584{6}[source]▶

>>45969780 #

Yeah, but there is still no reason why an "ip_stack" process can't ensure a different IP isn't used and a "gnu_tcp" or whatever process can't ensure tcp ports are assigned to only one calling process. An exclusive lock on the raw layer 2 device is what you're looking for I think. I mean right now, applications can just open a raw socket and use a conflicting tcp port. I've done to kill TCP connections matching some criteria by sending the remote end an RST pretending to be the real process (legit use case). Which approach is more performant, secure, and resilient? that's the what i'm asking here.

↑