Most active commenters

HumanOstrich(11)
notepad0x90(6)
esseph(5)
alecco(3)

Popular/hot comments

>>45959577 #
>>45958213 #
>>45958647 #

How to escape the Linux networking stack

(blog.cloudflare.com)

1. seabrookmx ◴[17 Nov 25 18:52 UTC] No.45956673[source]▶

I had to read their article on "soft-unicast" before I could really grok this one: https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-...

2. lazyeye ◴[17 Nov 25 19:23 UTC] No.45957095[source]▶

>>45954638 (OP) #

SLATFATF - "So long and thanks for all the fish" is a Douglas Adams quote

https://en.wikipedia.org/wiki/So_Long,_and_Thanks_for_All_th...

replies(1): >>45958538 #

3. notepad0x90 ◴[17 Nov 25 20:18 UTC] No.45957801[source]▶

>>45954638 (OP) #

I'm slightly surprised cloudflare isn't using a userspace tcp/ip stack already (faster - less context switches and copies). It's the type of company I'd expect to actually need one.

replies(2): >>45958128 #>>45959181 #

4. Droobfest ◴[17 Nov 25 20:51 UTC] No.45958128[source]▶

>>45957801 #

From 2016: https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...

replies(1): >>45958213 #

5. notepad0x90 ◴[17 Nov 25 20:59 UTC] No.45958213{3}[source]▶

>>45958128 #

Nice, they know better. But it also makes me wonder, because they're saying "but what if you need to run another app", I'd expect for things like loadbalancers for example, you'd only run one app per server on the data plane, the user space stack handles that, and the OS/services use a different control plane NIC with the kernel stack so that boxes are reachable even if there is link saturation, ddos,etc..

It also makes me wonder, why is tcp/ip special? The kernel should expose a raw network device. I get physical or layer 2 configuration happening in the kernel, but if it is supposed to do IP, then why stop there, why not TLS as well? Why run a complex network protocol stack in the kernel when you can just expose a configured layer 2 device to a user space process? It sounds like "that's just the way it's always been done" type of a scenario.

replies(3): >>45958565 #>>45959377 #>>45960224 #

6. cestith ◴[17 Nov 25 21:28 UTC] No.45958538[source]▶

>>45957095 #

A few things in the article are Douglas Adams quotes, and more specifically from the Hitchhiker’s Guide series.

Creating the universe being regarded as a mistake and making many unhappy is from those books. Whenever someone figures out the universe it gets replaced with something stranger and having evidence that’s happened repeatedly is too. The Restaurant at the End of the Universe is reference in the article.

I’m a bit surprised nothing in the article was mentioned as being “mostly harmless”.

replies(1): >>45960103 #

7. wmf ◴[17 Nov 25 21:31 UTC] No.45958565{4}[source]▶

>>45958213 #

AFAIK Cloudflare runs their whole stack on every machine. I guess that gives them flexibility and maybe better load balancing. They also seem to use only one NIC.

why is tcp/ip special? The kernel should expose a raw network device. ... Why run a complex network protocol stack in the kernel when you can just expose a configured layer 2 device to a user space process?

Check out the MIT Exokernel project and Solarflare OpenOnload that used this approach. It never really caught on because the old school way is good enough for almost everyone.

why stop there, why not TLS as well?

kTLS is a thing now (mostly used by Netflix). Back in the day we also had kernel-mode Web servers to save every cycle.

replies(1): >>45960187 #

8. alecco ◴[17 Nov 25 21:38 UTC] No.45958647[source]▶

>>45954638 (OP) #

Being a networking company I always wondered why did they pick Linux over FreeBSD.

replies(3): >>45960281 #>>45961602 #>>45962503 #

9. nomel ◴[17 Nov 25 22:33 UTC] No.45959181[source]▶

>>45957801 #

> faster - less context switches and copies

Aren't neither required these days with the "async" like and zero-copy interfaces that are now available (like io_uring, where it's still handled by the kernel), along with the nearly non-existence of single core processors in modern times?

replies(1): >>45962541 #

10. rcxdude ◴[17 Nov 25 22:55 UTC] No.45959377{4}[source]▶

>>45958213 #

You can do that if you want, but I think part of why tcp/ip is a useful layer of abstraction is it allows more robust boundaries between applications that may be running on the same machine. If you're just at layer 2 you are basically acting in behalf of the whole box.

11. marginalia_nu ◴[17 Nov 25 23:18 UTC] No.45959577[source]▶

>>45954638 (OP) #

This is extremely tangential, but I was working on setting up some manual network namespaces recently, basically manually reproducing what docker does to fix some of its faulty assumptions regarding containers having multiple IPs and a single name causing all sort of jank, and had to freshen up on a lot of Linux virtual networking concepts (namespaces, veths, bridge networks, macvlans and various other interfaces), made a ton of fairly informal notes to make myself sufficiently familiar with the thing to set it up.

Would anyone be interested if I polished it up and maybe added a refresher on the relevant layer 2 networking needed to reason about it? It's a fair bit of work and it's a niche topic, so I'm trying to poll a bit to see if the juice is worth the squeeze.

replies(11): >>45959749 #>>45959968 #>>45960118 #>>45960266 #>>45960554 #>>45960755 #>>45961911 #>>45961983 #>>45962002 #>>45962168 #>>45967111 #

12. manuelangel99 ◴[17 Nov 25 23:39 UTC] No.45959749[source]▶

>>45959577 #

I would def. be interestred!

13. msbhvn ◴[18 Nov 25 00:13 UTC] No.45959968[source]▶

>>45959577 #

Please do it, I'm very biased but I think there would be lots of interest in seeing all that explained in one place in a coherant fashion (you will likely sharpen your own understanding in the process and have the perfect resource for when you next need to revisit these topics).

14. snvzz ◴[18 Nov 25 00:13 UTC] No.45959969[source]▶

>>45954638 (OP) #

Tangentially related, seL4's LionsOS can now act as a router/firewall[0].

0. https://news.ycombinator.com/item?id=45959952

15. gishh ◴[18 Nov 25 00:37 UTC] No.45960103{3}[source]▶

>>45958538 #

One of these days I’ll figure out how to throw myself at the ground and miss.

16. MrResearcher ◴[18 Nov 25 00:39 UTC] No.45960118[source]▶

>>45959577 #

Don't forget to post the link here!

17. bbarnett ◴[18 Nov 25 00:51 UTC] No.45960187{5}[source]▶

>>45958565 #

Was it Tux? I've only used it, a looong time ago, on load balancers.

https://en.wikipedia.org/wiki/TUX_web_server

18. hansvm ◴[18 Nov 25 00:57 UTC] No.45960224{4}[source]▶

>>45958213 #

TCP/IP is, in theory (AFAIK all experiments related to this fizzled out a decade or two ago), a global resource when you start factoring in congestion control. TLS is less obviously something you would want kernel involvement from, give or take the idea of outsourcing crypto to the kernel or some small efficiency gains for some workloads by skipping userspace handoffs, with more gains possible with NIC support.

replies(2): >>45960346 #>>45961233 #

19. HumanOstrich ◴[18 Nov 25 01:05 UTC] No.45960266[source]▶

>>45959577 #

I was actually going down rabbitholes today trying to figure out how to do a sane Docker setup where all the containers couldn't connect to each other. Your notes would be valuable at most any level of polish.

replies(2): >>45961588 #>>45966377 #

20. HumanOstrich ◴[18 Nov 25 01:09 UTC] No.45960281[source]▶

>>45958647 #

Why does being a networking company suggest FreeBSD is the "right" pick?

replies(2): >>45960424 #>>45963102 #

21. Veserv ◴[18 Nov 25 01:22 UTC] No.45960346{5}[source]▶

>>45960224 #

You do want to offload crypto to dedicated hardware otherwise your transport will get stuck at a paltry 40-50 Gb/s per core. However, you do not need more than block decryption; you can leave all of the crypto protocol management in userspace with no material performance impact.

22. password4321 ◴[18 Nov 25 01:36 UTC] No.45960424{3}[source]▶

>>45960281 #

Serving Netflix Video at 400Gb/s on FreeBSD [pdf] (2021)

https://news.ycombinator.com/item?id=28584738

(I don't share this as "the answer" as much as one example from years past.)

replies(2): >>45960480 #>>45962649 #

23. HumanOstrich ◴[18 Nov 25 01:46 UTC] No.45960480{4}[source]▶

>>45960424 #

I think they used FreeBSD because they were already using FreeBSD. The article doesn't mention Linux.

24. ambicapter ◴[18 Nov 25 01:59 UTC] No.45960554[source]▶

>>45959577 #

I would absolutely be interested.

25. globalnode ◴[18 Nov 25 02:31 UTC] No.45960755[source]▶

>>45959577 #

i await your write up!

26. notepad0x90 ◴[18 Nov 25 04:00 UTC] No.45961233{5}[source]▶

>>45960224 #

why can't it be global and user space? DNS resolution for example is done by user space, and it is global.

replies(1): >>45962879 #

27. esseph ◴[18 Nov 25 05:06 UTC] No.45961588{3}[source]▶

>>45960266 #

If you create each container in its own network namespace, they won't be able to.

replies(1): >>45961736 #

28. esseph ◴[18 Nov 25 05:08 UTC] No.45961602[source]▶

>>45958647 #

BSD driver support lags behind pretty bad.

29. HumanOstrich ◴[18 Nov 25 05:35 UTC] No.45961736{4}[source]▶

>>45961588 #

It's a little more complex than that for any non-trivial layout where some containers do need to talk to other containers, but most don't.

replies(2): >>45961964 #>>45968890 #

30. dfedbeef ◴[18 Nov 25 06:09 UTC] No.45961911[source]▶

>>45959577 #

YES

31. brirec ◴[18 Nov 25 06:22 UTC] No.45961964{5}[source]▶

>>45961736 #

You could also create a network for each pair of containers that need to communicate with one another.

replies(2): >>45962220 #>>45964993 #

32. sevg ◴[18 Nov 25 06:25 UTC] No.45961983[source]▶

>>45959577 #

Yes please!

33. teleforce ◴[18 Nov 25 06:29 UTC] No.45962002[source]▶

>>45959577 #

Looking forward to that.

It's about time someone write a new linux networking book covering layer 2 and 3. Between switchdev, nftables and flowtables, there are many new information.

The existing books are already more than two decades old namely Linux Routing and Linux Routers (2nd edition).

34. pmontra ◴[18 Nov 25 06:59 UTC] No.45962168[source]▶

>>45959577 #

Yes of course. It would be great.

35. HumanOstrich ◴[18 Nov 25 07:07 UTC] No.45962220{6}[source]▶

>>45961964 #

That would create an excessive amount of bridges in my case. Also this is another trivial suggestion that anyone can find with a quick search or asking an LLM. Not helpful.

I'm not sure why people are replying to my comment with solutioning and trivial suggestions. All I did was encourage the thread OP to publish their notes. FWIW I've already been through a lot of options for solving my issue, and I've settled on one for now.

replies(1): >>45962711 #

36. pjmlp ◴[18 Nov 25 07:53 UTC] No.45962465[source]▶

>>45954638 (OP) #

I would expect they would do the same as other big scalers, and handle most of the networking in dedicated card firmware,

https://learn.microsoft.com/en-us/azure/azure-boost/overview...

https://learn.microsoft.com/en-us/azure/virtual-network/acce...

37. majke ◴[18 Nov 25 07:59 UTC] No.45962503[source]▶

>>45958647 #

This happened before my watch, but I always was rooting for Linux. Linux is winning on many aspects. Consider the featureset of iptables (CF uses loads of stuff, from "comment" to "tproxy"), bpf for metrics is a killer (ebpf_exporter), bpf for DDoS (XDP), Tcp fast open, UDP segmentation stuff, kTLS (arguably half-working). Then there is non-networking things like Docker, virtio ecosystem (vhost), seccomp, namespaces (net namespace for testing network apps is awesome). And the list goes on. Not to mention hiring is easier for Linux admins.

38. majke ◴[18 Nov 25 08:03 UTC] No.45962541{3}[source]▶

>>45959181 #

> > faster - less context switches and copies

This is very much newbie way of thinking. How do you know? Did you profile it?

It turns out there is surprisingly little dumb zero-copy potential at CF. Most of the stuff is TLS, so stuff needs to go through userspace anyway (kTLS exists, but I failed to actually use it, and what about QUIC).

Most of the cpu is burned on dumb things, like application logic. Turns out data copying and encryption and compression are actually pretty fast. I'm not saying these areas aren't ripe for optimization - but the majority of the cost was historically in much more obvious areas.

replies(1): >>45969802 #

39. victorbjorklund ◴[18 Nov 25 08:18 UTC] No.45962649{4}[source]▶

>>45960424 #

To be honest, I think when I heard them speak, they're kind of saying yes, FreeBSD is awesome but that the main reason is that the early people there liked FreeBSD so they just stuck with it. And that it's a good choice, but they don't claim these are things that would be impossible to do with optimizations in Linux.

40. kortilla ◴[18 Nov 25 08:29 UTC] No.45962711{7}[source]▶

>>45962220 #

> I'm not sure why people are replying to my comment with solutioning and trivial suggestions

Because your comment didn’t say you solved it and you asked for notes without any polish as if that would help.

replies(1): >>45966206 #

41. 1718627440 ◴[18 Nov 25 08:53 UTC] No.45962879{6}[source]▶

>>45961233 #

DNS isn't a shared resource, that needs to be managed and distributed fairly, among programs that don't trust and cooperate with each others.

replies(1): >>45969650 #

42. alecco ◴[18 Nov 25 09:35 UTC] No.45963102{3}[source]▶

>>45960281 #

Because FreeBSD is known for having the best network stack. The code is elegant and clean. And, at least until a few years ago, it was the preferred choice to build routers or firewalls.

AFAIK, they were the first to implement BPF for production ready code almost 3 decades ago.

https://en.wikipedia.org/wiki/Berkeley_Packet_Filter

But all this is opinion and anecdotal. Just pick a random network feature and compare by yourself the Linux and the FreeBSD code.

replies(1): >>45966234 #

43. marginalia_nu ◴[18 Nov 25 12:49 UTC] No.45964993{6}[source]▶

>>45961964 #

If you want point-to-point communication between two network namespaces, you should use veths[1]. I think virtual patch cables is a good mental model for veths.

If you want multiple participants, you use bridges, which are roughly analogous to switches.

[1] https://man7.org/linux/man-pages/man4/veth.4.html

44. HumanOstrich ◴[18 Nov 25 14:07 UTC] No.45966206{8}[source]▶

>>45962711 #

I didn't say I settled on a solution for all time. I said "for now". I'm still interested in alternatives.

45. HumanOstrich ◴[18 Nov 25 14:09 UTC] No.45966234{4}[source]▶

>>45963102 #

> But all this is opinion and anecdotal.

Exactly.

replies(1): >>45970232 #

46. aryonoco ◴[18 Nov 25 14:19 UTC] No.45966377{3}[source]▶

>>45960266 #

I put each docker container in a LXC container which effectively uses namespaces, cgroups etc to isolate them.

47. anbotero ◴[18 Nov 25 15:03 UTC] No.45967111[source]▶

>>45959577 #

Most definitely. Not just for myself, but for some of my peers here too.

48. esseph ◴[18 Nov 25 16:58 UTC] No.45968890{5}[source]▶

>>45961736 #

That's a change from what was asked which was isolation between each.

Yes, if they need to talk, share namespaces.

If you don't want a generic but true answer, don't ask a generic question and then be upset when the responses don't have enough detail about your specific situation that you hadn't described :-)

replies(1): >>45971922 #

49. notepad0x90 ◴[18 Nov 25 17:52 UTC] No.45969650{7}[source]▶

>>45962879 #

DNS resolution is a shared resource. The DNS client is typically a user-space OS service that resolves and caches DNS requests. What is resolved by one application is cached and reused by another. But at the app level, there are is no deconflicting happening like transport layer protocols. However, the same can be said about IP, IP addresses like name servers are configured system wide and shared by all apps.

replies(1): >>45969780 #

50. 1718627440 ◴[18 Nov 25 18:01 UTC] No.45969780{8}[source]▶

>>45969650 #

It can be shared access to a cache, but this is an implementation detail for performance reasons. There is no problem with having different processes resolve DNS with different code. There is a problem if two processes want to control the same IP address, or manage the same TCP port.

replies(1): >>45974584 #

51. notepad0x90 ◴[18 Nov 25 18:03 UTC] No.45969802{4}[source]▶

>>45962541 #

> This is very much newbie way of thinking. How do you know? Did you profile it?

Does it matter? less syscalls is better. Whatever is being done in kernel mode can be replicated (or improved upon much more) in a user-space stack. It is easier to add/manage api's in user space than kernel apis. You can debug, patch, etc.. a user space stack much more easily. You can have multiple processes for redundancy, ensure crashes don't take out the whole system. I've had situations where rebooting the system was the only solution to routing or arp resolution issues (even after clearing caches). Same with netfilter/iptables "being stuck" or exhibiting performance degradation over time. if you're lucky a module reload can fix it, if it was a process I could have just killed/restarted it with minimal disruption.

> Most of the cpu is burned on dumb things, like application logic. Turns out data copying and encryption and compression are actually pretty fast. I'm not saying these areas aren't ripe for optimization - but the majority of the cost was historically in much more obvious areas.

I won't disagree with that, but one optimization does not preclude the other. if ip/tcp were user-space, they could be optimized better by engineers to fit their use cases. The type of load matters too, you can optimize your app well, but one corner case could tie up your app logic in cpu cycles, if that happens to include a syscall, and if there is no better way to handle it, those context switch cycles might start mattering.

In general, I don't think it makes much difference..but I expected companies like CF that are performance and outage sensitive to strain every last drop of performance and reliability out of their system.

52. alecco ◴[18 Nov 25 18:41 UTC] No.45970232{5}[source]▶

>>45966234 #

> But all this is opinion and anecdotal. Just pick a random network feature and compare by yourself the Linux and the FreeBSD code.

Why did you take out of context my self-criticism and omitted the second part of the line showing how you can see this by yourself?

replies(1): >>45971943 #

53. HumanOstrich ◴[18 Nov 25 20:55 UTC] No.45971922{6}[source]▶

>>45968890 #

I didn't ask a question and I wasn't upset. :-)

replies(1): >>45979652 #

54. HumanOstrich ◴[18 Nov 25 20:57 UTC] No.45971943{6}[source]▶

>>45970232 #

"Go research it yourself" does not back up your claim that FreeBSD is the "best" for networking.

55. notepad0x90 ◴[19 Nov 25 01:06 UTC] No.45974584{9}[source]▶

>>45969780 #

Yeah, but there is still no reason why an "ip_stack" process can't ensure a different IP isn't used and a "gnu_tcp" or whatever process can't ensure tcp ports are assigned to only one calling process. An exclusive lock on the raw layer 2 device is what you're looking for I think. I mean right now, applications can just open a raw socket and use a conflicting tcp port. I've done to kill TCP connections matching some criteria by sending the remote end an RST pretending to be the real process (legit use case). Which approach is more performant, secure, and resilient? that's the what i'm asking here.

56. esseph ◴[19 Nov 25 14:06 UTC] No.45979652{7}[source]▶

>>45971922 #

If you need more / different isolation, you're going to need custom nftables/ebtables rules.

In another model you could drop each bridge onto a unique vlan, and firewall them.

There's tons of options out there.

Anyway, if you had more specifics to go off of, there's plenty of network engineers and kubernetes/docker admins floating around willing to help - maybe start a Ask HN post?

replies(1): >>45980376 #

57. HumanOstrich ◴[19 Nov 25 14:59 UTC] No.45980376{8}[source]▶

>>45979652 #

You're still offering suggestions I said I didn't ask for. I'm sure you're trying to help, but at this point you're coming across as passive-aggressive.

replies(1): >>45980470 #

58. esseph ◴[19 Nov 25 15:07 UTC] No.45980470{9}[source]▶

>>45980376 #

You asked for the notes of somebody that's done isolation in different ways in docker.

Your responses have confused me so much I showed them to my partner, who is also confused.

replies(1): >>45980610 #

59. HumanOstrich ◴[19 Nov 25 15:17 UTC] No.45980610{10}[source]▶

>>45980470 #

I asked the person I was replying to for their notes because they were asking if anyone was interested in them.

↑