It's time to replace TCP in the datacenter (2023)

(arxiv.org)

189 points ilove_banh_mi | 1 comments | 18 Nov 24 01:42 UTC | HN request time: 0.238s | source

Show context

slt2021 ◴[18 Nov 24 05:45 UTC] No.42170020[source]▶

the problem with trying to replace TCP only inside DC, is because TCP will still be used outside DC.

Networking Engineering is already convoluted and troublesome as it is right now, using only tcp stack.

When you start using homa inside, but TCP from outside things will break, because a lot of DC requests are created as a response for an inbound request from outside DC (like a client trying to send RPC request).

I cannot imagine trying to troubleshoot hybrid problems at the intersection of tcp and homa, its gonna be a nightmare.

Plus I don't understand why create a a new L4 transport protocol for a specific L7 application (RPC)? This seems like a suboptimal choice, because RPC of today could be replaced with something completely different, like RDMA over Ethernet for AI workloads or transfer of large streams like training data/AI model state.

I think tuning TCP stack in the kernel, adding more configuration knobs for TCP, switching from stream(tcp) to packet (udp) protocols where it is warranted, will give more incremental benefits.

One major thing author missed is security applications, these are considered table stakes: 1. encryption in transit: handshake/negotiation 2. ability to intercept and do traffic inspection for enterprise security purposes 3. resistance to attacks like flood 4. security of sockets in containerized Linux environment

replies(2): >>42170162 #>>42170268 #

jayd16 ◴[18 Nov 24 06:35 UTC] No.42170268[source]▶

>>42170020 #

Are you imagining external TCP traffic will be translated at the load balancer or are you actually worried that requests out of an API Gateway need to be identical to what goes in?

I could see the former being an issue (if that's even implied by "inside the data center") and I just don't see how it's a problem for the latter.

replies(1): >>42170965 #

1. slt2021 ◴[18 Nov 24 09:24 UTC] No.42170965[source]▶

>>42170268 #

A typical software L7 load balancer (like nginx) will parse entire TCP stream and HTTP header and applies bunch of logic based on URL, and various HTTP headers.

There is a lot of work going on in the userland, like filling up TCP buffer, parsing HTTP stream, applying bunch of business logic, creating a downstream connection, sending data, getting response, etc.

This is a lot of work in the userland and because of that a default nginx config is like 1024 concurrent connections per core, so not a lot.

L4 load balance on the other hand works purely in a packet switching mode or NAT mode. So the work consists in just replacing IP header fields (src.ip, src.port, dst.ip, dst.port, proto), it can use various frameworks like intel vectorized packet processing or Intel dpdk for accelerated packet switching.

Because of that, L4 load balancer can work perform very very close to the line rate speed, meaning it can load balance connections as fast as packets arrive to the network interface card. Line rate is the theoretical maximum of packet processing.

In case of stateless L4 load balancing there is no upper bound in number of concurrent sessions to balance, it will almost as fast as core router that feeds the data.

As you can see L4 is clearly superior in performance, but the reason L4 LB is possible is because it has TCP inbound and TCP outbound, so the only work required is replace IP header and recalculate CRC.

With Homa, you would need to fully process TCP stream, before you initiate Homa connection, meaning you will waste a lot of RAM on keeping TCP buffers and rebuilding the stream according to the TCP sequence. Homa will lose all its benefits in the load balancing scenario.

Author pitches only one use case for homa: East-West traffic, but again - these days the software is really agnostic of this East-West direction. What your software thinks is running in the server in next rack, could as well be a server in a different Availability Zone or read replica in different geo region.

And that's the beauty of modern infra: everything is a software, everything is ephemeral, and we don't really care if we running this in a single DC or multiple DCs.

Because of that, I think we will still stick to TCP as a proven protocol that will seamlessly interop when crossing different WAN/LAN/VPN networks

I am not even talking about software defined networks, like SD-WAN where transport&signaling is done by the vendor-specfic underlay network, and overlay network is really just abstraction for users that hides a lot network discovery and network management underneath

↑