How to escape the Linux networking stack

(blog.cloudflare.com)

148 points meysamazad | 2 comments | 17 Nov 25 15:49 UTC | HN request time: 0.606s | source

Show context

notepad0x90 ◴[17 Nov 25 20:18 UTC] No.45957801[source]▶

I'm slightly surprised cloudflare isn't using a userspace tcp/ip stack already (faster - less context switches and copies). It's the type of company I'd expect to actually need one.

replies(2): >>45958128 #>>45959181 #

nomel ◴[17 Nov 25 22:33 UTC] No.45959181[source]▶

>>45957801 #

> faster - less context switches and copies

Aren't neither required these days with the "async" like and zero-copy interfaces that are now available (like io_uring, where it's still handled by the kernel), along with the nearly non-existence of single core processors in modern times?

replies(1): >>45962541 #

1. majke ◴[18 Nov 25 08:03 UTC] No.45962541[source]▶

>>45959181 #

> > faster - less context switches and copies

This is very much newbie way of thinking. How do you know? Did you profile it?

It turns out there is surprisingly little dumb zero-copy potential at CF. Most of the stuff is TLS, so stuff needs to go through userspace anyway (kTLS exists, but I failed to actually use it, and what about QUIC).

Most of the cpu is burned on dumb things, like application logic. Turns out data copying and encryption and compression are actually pretty fast. I'm not saying these areas aren't ripe for optimization - but the majority of the cost was historically in much more obvious areas.

replies(1): >>45969802 #

2. notepad0x90 ◴[18 Nov 25 18:03 UTC] No.45969802[source]▶

>>45962541 (TP) #

> This is very much newbie way of thinking. How do you know? Did you profile it?

Does it matter? less syscalls is better. Whatever is being done in kernel mode can be replicated (or improved upon much more) in a user-space stack. It is easier to add/manage api's in user space than kernel apis. You can debug, patch, etc.. a user space stack much more easily. You can have multiple processes for redundancy, ensure crashes don't take out the whole system. I've had situations where rebooting the system was the only solution to routing or arp resolution issues (even after clearing caches). Same with netfilter/iptables "being stuck" or exhibiting performance degradation over time. if you're lucky a module reload can fix it, if it was a process I could have just killed/restarted it with minimal disruption.

> Most of the cpu is burned on dumb things, like application logic. Turns out data copying and encryption and compression are actually pretty fast. I'm not saying these areas aren't ripe for optimization - but the majority of the cost was historically in much more obvious areas.

I won't disagree with that, but one optimization does not preclude the other. if ip/tcp were user-space, they could be optimized better by engineers to fit their use cases. The type of load matters too, you can optimize your app well, but one corner case could tie up your app logic in cpu cycles, if that happens to include a syscall, and if there is no better way to handle it, those context switch cycles might start mattering.

In general, I don't think it makes much difference..but I expected companies like CF that are performance and outage sensitive to strain every last drop of performance and reliability out of their system.

↑