←back to thread

93 points rbanffy | 2 comments | | HN request time: 0.001s | source
Show context
declan_roberts ◴[] No.42188424[source]
Do super computers need proximity to other compute nodes in order to perform this kind of computations?

I wonder what would happen if Apple offered people something like iCloud+ in exchange for using their idle M4 compute at night time for a distributed super computer.

replies(3): >>42188437 #>>42188458 #>>42188486 #
1. theideaofcoffee ◴[] No.42188458[source]
The thing that sets these machines apart from something that you could set up in AWS (to some degree), or in a distributed sense like you're suggesting is the interconnect, how the compute nodes communicate. For a large system like El Capitan, you're paying a large chunk of the cost in connecting the nodes together, low latency, interesting topologies that ethernet, nor even Infiniband can get close to. Code that requires a lot of DMA or message passing really will take up all of the bandwidth that's available, that becomes the primary bottleneck in these systems.

The interconnect has been Cray's bread and butter for multiple decades: Slingshot, Dragonfly, Aries, Gemini, SeaStar, numalink via sgi, etc. and those for the less massively parallel systems before those.

replies(1): >>42189849 #
2. sliken ◴[] No.42189849[source]
I've seen nothing showing that slingshot has any particular advantage over IB for HPC. Sure HPE pushes slingshot (an HPE interconnect) over giving bags of money to Nvidia, but that's a business decisions. Eagle (the #4 cluster on the list) is Infiniband NDR.

I believe 306 of the top 500 clusters used Infiniband. Pretty sure the advance topologies like dragonfly are supported on IB as well as Slingshot. From what I can tell slingshot is much like ultra ethernet, trying to take the best of IB and ethernet and making a new standard. From what I can tell slingshot 11 latency is much like I got with omnipath/pathscale way back when dual core opterons were the cutting edge.