←back to thread

188 points ilove_banh_mi | 1 comments | | HN request time: 0.209s | source
1. mhandley ◴[] No.42171095[source]
It's already happening. For the more demanding workloads such as AI training, RDMA has been the norm for a while, either over Infiniband or Ethernet, with Ethernet gaining ground more recently. RoCE is pretty flawed though for reasons Ousterhout mentions, plus others, so a lot of work has been happening on new protocols to be implemented in hardware in next-gen high performance NICs.

The Ultra Ethernet Transport specs aren't public yet so I can only quote the public whitepaper [0]:

"The UEC transport protocol advances beyond the status quo by providing the following:

● An open protocol specification designed from the start to run over IP and Ethernet

● Multipath, packet-spraying delivery that fully utilizes the AI network without causing congestion or head-of-line blocking, eliminating the need for centralized load-balancing algorithms and route controllers

● Incast management mechanisms that control fan-in on the final link to the destination host with minimal drop

● Efficient rate control algorithms that allow the transport to quickly ramp to wire-rate while not causing performance loss for competing flows

● APIs for out-of-order packet delivery with optional in-order completion of messages, maximizing concurrency in the network and application, and minimizing message latency

● Scale for networks of the future, with support for 1,000,000 endpoints

● Performance and optimal network utilization without requiring congestion algorithm parameter tuning specific to the network and workloads

● Designed to achieve wire-rate performance on commodity hardware at 800G, 1.6T and faster Ethernet networks of the future"

You can think of it as the love-child of NDP [2] (including support for packet trimming in Ethernet switches [1]) and something similar to Swift [3] (also see [1]).

I don't know if UET itself will be what wins, but my point is the industry is taking the problems seriously and innovating pretty rapidly right now.

Disclaimer: in a previous life I was the editor of the UEC Congestion Control spec.

[0] https://ultraethernet.org/wp-content/uploads/sites/20/2023/1...

[1] https://ultraethernet.org/ultra-ethernet-specification-updat...

[2] https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acm...

[3] https://research.google/pubs/swift-delay-is-simple-and-effec...