Most active commenters
  • UltraSane(10)

←back to thread

188 points ilove_banh_mi | 48 comments | | HN request time: 0.611s | source | bottom
1. UltraSane ◴[] No.42170007[source]
I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter. It is a very robust L3 protocol. It was designed to connect block storage devices to servers while making the OS think they are directly connected. OSs do NOT tolerate dropped data when reading and writing to block devices and so Fibre Channel has a extremely robust Token Bucket algorithm. The algo prevents congestion by allowing receivers to control how much data senders can send. I have worked with a lot of VMware clusters that use FC to connect servers to storage arrays and it has ALWAYS worked perfectly.
replies(9): >>42170384 #>>42170465 #>>42170698 #>>42171057 #>>42171576 #>>42171890 #>>42174071 #>>42174140 #>>42175585 #
2. Sebb767 ◴[] No.42170384[source]
> I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter

But it is often used for block storage in datacenters. Using it for anything else is going to be hard, as it is incompatible with TCP.

The problem with not using TCP is the same thing HOMA will face - anything already speaks TCP, nearly all potential hires know TCP and most problems you have with TCP have been solved by smart engineers already. Hardware is also easily available. Once you drop all those advantages, either your scale or your gains need to be massive to make that investment worth it, which is why TCP replacements are so rare outside of FAANG.

replies(2): >>42170978 #>>42171609 #
3. YZF ◴[] No.42170465[source]
Are you suggesting some protocol layer of Fibre Channel to be used over IP over Ethernet?

TCP (in practice) runs on top of (mostly) routed IP networks and network architectures. E.g. a spine/leaf network with BGP. Fibre Channel as I understand it is mostly used in more or less point to point connections? I do see some mention of "Switched Fabric" but is that very common?

replies(1): >>42171091 #
4. wejick ◴[] No.42170698[source]
I'm imagining having a shared memory mounted as block storages then do the RPC thru this block. Some synchronization and polling/notifications work will need to be done.
replies(2): >>42171599 #>>42171658 #
5. ksec ◴[] No.42170978[source]
I wonder if there are any work on making something similar ( conceptually ) to TCP, super / sub set of TCP while offering 50-80% benefits of HOMA.

I guess I am old. Everytime I see new tech that wants to be hyped, completely throw out everything that is widely supported and working for 80-90% of uses cases, not battle tested and may be conceptually complex I will simply pass.

replies(1): >>42171229 #
6. slt2021 ◴[] No.42171057[source]
my take is that within-datacenter traffic is best served by Ethernet.

Anything on top of Ethernet, and we no longer know where this host is located (because of software defined networking). Could be next rack server, or could be something in the cloud, could be third party service.

And that's a feature, not a bug: because everything speaks TCP: we can arbitrarily cut and slice network just by changing packet forwarding rules. We can partition network however we want.

We could have a single global IP space shared by cloud, dc, campus networks, or could have Christmas Tree of NATs.

as soon as you introduce something other than TCP to the mix, now you will have gateways: chokepoints where traffic will have to be translated TCP<->Homa and I don't want to be a person troubleshooting a problem at the intersection of TCP and Homa.

in my opinion, the lowest level Ethernet should try its best to mirror the actual physical signal flow. Anything on top becomes software-network network

replies(2): >>42171328 #>>42174488 #
7. UltraSane ◴[] No.42171091[source]
Fibre Channel is a routed L3 protocol that can support loop-free multi-path typologies.
replies(1): >>42179900 #
8. Sebb767 ◴[] No.42171229{3}[source]
If you have a sufficiently stable network and/or known failure cases, you can already tune TCP quite a bit with nodelay, large congestion windows etc.. There's also QUIC, which basically is a modern implementation of TCP on top of UDP (with some trade-offs chosen with HTTP in mind). Once you stray too far, you'll loose the ability to use off-the-shelve hardware, though, at which point you'll quickly hit the point of diminishing returns - especially when simply upgrading the speed of the network hardware is usually a cheap alternative.
replies(2): >>42173083 #>>42175428 #
9. mafuy ◴[] No.42171328[source]
In data centers/HPC, you need to know which data is flowing where and then you design the hardware around that. Not the other way around. What you describe is a lower requirement level that is much easier to handle.
replies(1): >>42172289 #
10. KaiserPro ◴[] No.42171576[source]
FC was much more expensive than ethernet, so needed a reason to be used.

For block storage it is great, if slower than ethernet.

replies(1): >>42176553 #
11. creshal ◴[] No.42171599[source]
The literal version of this is used by sanlock et al. to implement cluster-wide locks. But the whole "pretending to be block devices" overhead makes it not exactly ideal for anything else.

Drop the "pretending to be block devices" part and you basically end up with InfiniBand. It works well, if you ignore the small problem of "you need to either reimplement all your software to work with RDMA based IPC, or reimplement Ethernet on top of InfiniBand to remain compatible and throw away most advantages of InfiniBand again".

12. jbverschoor ◴[] No.42171609[source]
But you might not need TCP. For example, using file-sockets between an app, db, and http server (rails+pgsql+nginx for example) has many benefits. The beauty of OSI layers.
replies(3): >>42172291 #>>42172297 #>>42174198 #
13. fmajid ◴[] No.42171658[source]
That’s essentially what RDMA is, except it is usually run over Infiniband although hyperscalers are wary of Nvidia’s control over the technology and looking for cheaper Ethernet-based alternatives.

https://blogs.nvidia.com/blog/what-is-rdma/

https://dl.acm.org/doi/abs/10.1145/3651890.3672233

replies(1): >>42172621 #
14. holowoodman ◴[] No.42171890[source]
Fibrechannel is far too expensive, you need expensive switches, cables/transceivers and cards in addition to the Ethernet you'll need anyways. And this Fibrechannel hardware is quite limited in what you can do with it, by far not as capable as the usual Ethernet/IP stuff with regards to routing, encryption, tunneling, filtering and what not.

Similar things are happening with stuff like Infiniband, it has become far too expensive and Ethernet/ROCE is making inroads in lower- to medium-end installations. Availability is also an issue, Nvidia is the only Infiniband vendor left.

replies(2): >>42171954 #>>42176658 #
15. bluGill ◴[] No.42171954[source]
there is ip over fiber channel. no need for separate ethernet. At least in theory, in practice I'm sure if anyone implemented enough parts to make it useful but the spec exists.
replies(1): >>42176493 #
16. marcosdumay ◴[] No.42172289{3}[source]
That may be true for HPC, but "datacenter" is a generic name that applies to all kinds of structures.
17. oneplane ◴[] No.42172291{3}[source]
That would work on a single host, but the context of the datacenter probably assumes multihost/manyhost workloads.
18. wutwutwat ◴[] No.42172297{3}[source]
Unix sockets can use tcp, udp, or be a raw stream

https://en.wikipedia.org/wiki/Unix_domain_socket#:~:text=The....

Puma creates a `UnixServer` which is a ruby stdlib class, using the defaults, which is extending `UnixSocket` which is also using the defaults

https://github.com/puma/puma/blob/fba741b91780224a1db1c45664...

Those defaults are creating a socket of type `SOCK_STREAM`, which is a tcp socket

> SOCK_STREAM will create a stream socket. A stream socket provides a reliable, bidirectional, and connection-oriented communication channel between two processes. Data are carried using the Transmission Control Protocol (TCP).

https://github.com/ruby/ruby/blob/5124f9ac7513eb590c37717337...

You still have the tcp overhead when using a local unix socket with puma, but you do not have any network overhead.

replies(2): >>42174551 #>>42174673 #
19. hylaride ◴[] No.42172621{3}[source]
If it's a secure internal network, RDMA is probably what you want if you need low-latency data transfer. You can do some very performance-oriented things with it and it works over ethernet or infiniband (the quality of switching gear and network cards matters, though).

Back in ~2012 I was setting up a high-frequency network for a forex company and at the time we deployed Mellanox and they had some very (at the time) bleeding edge networking drivers that significantly reduced the overhead of writing to TCP/IP sockets (particularily zero-copy which TL;DR meant data didn't get shifted around in memory as much and was written to the ethernet card's buffers almost straight away) that made a huge difference.

I eventually left the firm and my successors tried to replace it with cisco gear and Intel NICs and the performance plummeted. That made me laugh as I received so much grief pushing for the Mellanox kit (to be fair, they were a scrappy unheard of Israeli company at the time).

20. mikepurvis ◴[] No.42173083{4}[source]
QUIC feels very pragmatic in terms of being built on UDP. As a lay person I don’t have a sense what additional gains might be on the table if the UDP layer were also up for reconsideration.
replies(1): >>42173548 #
21. soneil ◴[] No.42173548{5}[source]
UDP has very low cost, the header is pretty much source and dest ports. For this low, low price, you get compatibility with existing routing, firewalling, NAT, etc.
22. markhahn ◴[] No.42174071[source]
the question is really: does it have anything vendor-specific, interop-breakers?

FC seems to work nicely in a single-vendor stack, or at least among specific sets of big-name vendors. that's OK for the "enterprise" market, where prices are expected to be high, and where some integrator is getting a handsome profit for making sure the vendors match.

besides consumer, the original non-enterprise market was HPC, and we want no vendor lock-in. hyperscale is really just HPC-for-VM-hosting - more or less culturally compatible.

besides these vendor/price/interop reasons, FC has never done a good job of keeping up. 100/200/400/800 Gb is getting to be common, and is FC there?

resolving congestion is not unique to FC. even IB has more, probably better solutions, but these days, datacenter ethernet is pretty powerful.

replies(1): >>42176648 #
23. tonetegeatinst ◴[] No.42174140[source]
Fiber is attractive. As someone who wants to upgrade to fiber, the main barrier to entry is the cost of switches and a router.

Granted I'm also trying to find a switch that supports ROCm and rdma. Not easy to find a high bandwidth switch that supports this stuff without breaking the bank.

replies(3): >>42174857 #>>42175957 #>>42176517 #
24. kjs3 ◴[] No.42174198{3}[source]
How do you think 'file-sockets' are implemented over a network?
replies(1): >>42181820 #
25. gsich ◴[] No.42174488[source]
>Anything on top of Ethernet, and we no longer know where this host is located (because of software defined networking). Could be next rack server, or could be something in the cloud, could be third party service.

Ping it and you can at least deduce where it's not.

26. FaceValuable ◴[] No.42174551{4}[source]
Hey! I know it’s muddled but that’s not quite correct. SOCK_STREAM is more general than TCP; SOCK_STREAM just means the socket is a byte stream. You would need to add IPPROTO_TCP on top of that to pull in the TCP stack.

UDS using SOCK_STREAM does not do that; ie, it is not using IPPROTO_TCP.

27. shawn_w ◴[] No.42174673{4}[source]
Unix domain stream sockets do not use tcp. Nor do unix datagram sockets use udp. They're much simpler.
replies(1): >>42174775 #
28. wutwutwat ◴[] No.42174775{5}[source]
>The type parameter should be one of two common socket types: stream or datagram.[10] A third socket type is available for experimental design: raw.

> SOCK_STREAM will create a stream socket. A stream socket provides a reliable, bidirectional, and connection-oriented communication channel between two processes. Data are carried using the Transmission Control Protocol (TCP).

> SOCK_DGRAM will create a datagram socket.[b] A Datagram socket does not guarantee reliability and is connectionless. As a result, the transmission is faster. Data are carried using the User Datagram Protocol (UDP).

> SOCK_RAW will create an Internet Protocol (IP) datagram socket. A Raw socket skips the TCP/UDP transport layer and sends the packets directly to the network layer.

I don't claim to be an expert, I just have a certain confidence that I'm able to comprehend words I read. It seems you can have 3 types of sockets, raw, udp, or tcp.

https://en.wikipedia.org/wiki/Unix_domain_socket

replies(3): >>42175509 #>>42176117 #>>42192860 #
29. jabl ◴[] No.42174857[source]
Fiber, as in Fibre Channel (FC, https://en.wikipedia.org/wiki/Fibre_Channel ), not fiber as in "optical fiber" instead of copper cabling.
30. mananaysiempre ◴[] No.42175428{4}[source]
One issue with QUIC in e.g. C is how heavyweight it feels to use compared to garden-variety BSD sockets (and that’s already not the most ergonomic of APIs). I haven’t encountered a QUIC library that didn’t feel like it would absolutely dominate a simple application in both code size and API-design pressure. Of course, for large applications that’s less relevant, but the result is that switching to QUIC gets perceived as a Big Deal, a step for when you’re doing Serious Stuff. That’s not ideal.

I’d love to just play with QUIC a bit because it’s pretty neat, but I always get distracted by this problem and end up reading the RFCs, which so far I haven’t had the patience to get through.

31. FaceValuable ◴[] No.42175509{6}[source]
Interesting! The Wikipedia article is quite wrong here. SOCK_STREAM certainly doesn’t imply TCP in all cases. I see the source is the Linux Programming Interface book; quite likely someone’s interpretation of that chapter was just wrong when they wrote this article. It is a subtle topic.
32. Sylamore ◴[] No.42175585[source]
InfiniBand would make more sense than Fibre Channel
replies(1): >>42176542 #
33. _zoltan_ ◴[] No.42175957[source]
SN2100/2700 from eBay?
34. shawn_w ◴[] No.42176117{6}[source]
Not the first time a Wikipedia article has been wrong. That one seems to be talking about IP sockets and local ones at the same time instead of focusing on local ones. Could definitely stand to be rewritten.
35. UltraSane ◴[] No.42176493{3}[source]
No. When I heard about Cisco having FC over Ethernet for their UCS servers I was grossed out because of how Ethernet is a L2 protocol that can't handle multi-path without ugly hacks like Virtual Port Channel and discovered that there is no real support for IP over Fiber Channel. There is a wikipedia page for IPFC but it seems to be completely dead.

https://en.wikipedia.org/wiki/IPFC

36. UltraSane ◴[] No.42176517[source]
Fibre Channel is a routed protocol invented specifically to connect block storage arrays to remote servers while making the block storage look locally attached to the OS. And it works REALLY REALLY well.
37. UltraSane ◴[] No.42176542[source]
Maybe but InfiniBand is really expensive and InfiniBand switches are in short supply. And it is an L2 protocol while FC is L3
38. UltraSane ◴[] No.42176553[source]
Is their a fundamental reason for it being more expensive than Ethernet or is just greed?
replies(1): >>42183545 #
39. UltraSane ◴[] No.42176648[source]
FC speeds have really lagged. 64Gbps is available but not widely used and 128Gbps was introduced in 2023. But since by definition 100% of FC bandwidth can only be used for storage it has been adequate.

https://fibrechannel.org/preview-the-new-fibre-channel-speed...

40. UltraSane ◴[] No.42176658[source]
Is there a fundamental reason why FC is more expensive than Ethernet?
replies(1): >>42177094 #
41. convolvatron ◴[] No.42177094{3}[source]
since its entire reason to exist was to effect an artificial market segmentation, I guess the answer is .. yes?
replies(1): >>42180658 #
42. YZF ◴[] No.42179900{3}[source]
I'll admit I'm not familiar with the routing protocols used for Fibre Channel. Is there some equivalent of BGP? How well does it scale? What vendors sell FC switches and what's the cost compared to Ethernet/IP/TCP?
replies(1): >>42180728 #
43. UltraSane ◴[] No.42180658{4}[source]
It isn't an artificial market segmentation. Fibre Channel is a no compromise technology with a single purpose, to connect servers to remote storage with performance and reliability close to directly attached storage, and it does that really, REALLY well. It is by far the single most bullet proof technology I have ever used. In a parallel universe where FC won over ethernet and every Ethernet port in the world was an FC port I don't see why it would be any more expensive that ethernet.
44. UltraSane ◴[] No.42180728{4}[source]
FC uses Fabric Shortest Path First, which is a lot like OSPF. It can scale to 2^64 ports. There is no FC equivalent of BGP. Broadcom, Cisco, HP, Lenovo, IBM sell FC switches, but some of them are probably rebadged Broadcom switches. The worst thing about FC is that the switches are licensed per port so you might not be able to use all the ports on the device. A brocade G720 with 24 32Gb usable ports is $28,000 on CDW. It has 64 physical ports. a 24 port license is $31,000 on CDW. So it is REALLY freaking expensive. But for servers a company can't make money without it is absolutely worth it. One place I worked had a old EMC FC SAN with 8 years of 100% uptime.
45. jbverschoor ◴[] No.42181820{4}[source]
They don’t have to use TCP. The point was to use sockets as the abstraction layer and use another inter-connect instead of TCP/IP. That way you’ve easily replaced TCP in the datacenter without major changes to many applications
replies(1): >>42185096 #
46. KaiserPro ◴[] No.42183545{3}[source]
At the time it was rarer and required more specialised hardware.

FC was one of the first non-mainframe specific storage area network fabrics. One of the key differences between FC and ethernet is the collision detection/avoidance and guaranteed throughput. All that extra coordination takes effort on big switches, so it cost more to develop/produce.

You could shovel data through it at "line speed" and know that it'll get there, or you'll get an error at the fabric level. Ethernet happily takes packets, and if it can't deliver, then it'll just drop them. (well, kinda, not all ethernet does that, because ethernet is the english language of layer 2 protocols)

47. kjs3 ◴[] No.42185096{5}[source]
Oh...your argument is replacing TCP is the easy part. Gotcha. Sure.
48. rcxdude ◴[] No.42192860{6}[source]
Yeah, someone's gotten confused. SOCK_DGRAM and SOCK_STREAM imply TCP and UDP when using AF_INET sockets, but not when using AF_UNIX sockets, though unix domain sockets do often somewhat do an impression of a TCP socket (e.g. they will report a connection state in the same way as TCP). Reads and writes to unix domain sockets essentially amount to the kernel copying memory between different processes (interestingly, linux will also do the same for local TCP/UDP connections as well, as an optimisation. So the data never actually gets formatted into separate packets). This also accounts for some of the things you can do with unix sockets you can't do with a network protocol, like pass permissions and file descriptors across them.