Google's Liquid Cooling

1. m463 ◴[25 Aug 25 19:58 UTC] No.45018271[source]▶

>>45016720 (OP) #

I wonder what the economics of water cooling really is.

Is it because chips are getting more expensive, so it is more economical to run them faster by liquid cooling them?

Or is it data center footprint is more expensive, so denser liquid cooling makes more sense?

Or is it that wiring distances (1ft = 1nanosecond) make dense computing faster and more efficient?

replies(6): >>45018323 #>>45018352 #>>45018353 #>>45020042 #>>45022675 #>>45025262 #

2. summerlight ◴[25 Aug 25 20:03 UTC] No.45018323[source]▶

>>45018271 (TP) #

Not sure about classical computing demands, but I think wiring distances definitely matter for TPU-like memory heavy computation.

3. moffkalast ◴[25 Aug 25 20:05 UTC] No.45018352[source]▶

>>45018271 (TP) #

It's more of a testament to inefficiency, with rising TDP year after year as losses get larger with smaller nm processes. It's so atrocious, even in the consumer sector Nvidia can't even design a connector that doesn't melt during normal usage because their power draw has become beyond absurd.

People don't really complain about crappy shovels during a gold rush though unfortunately, they're just happy they got one before they ran out. They have no incentive to innovate in efficiency while the performance line keeps going up.

4. MurkyLabs ◴[25 Aug 25 20:05 UTC] No.45018353[source]▶

>>45018271 (TP) #

It's a mixture of both 2 and 3. The chips are getting hotter because they're compacting more stuff in a small space and throwing more power into them. At the same time, powering all those fans that cool the computers takes a lot of power (when you have racks and racks those small fans add up quickly) and that heat is then blown into hot isles that need to then circulate the heat to A/C units. With liquid cooling they're able to save costs due to lower electricity usage and having direct liquid to liquid cool as apposed to chip->air->AC->liquid. ServeTheHome did a write up on it last year, https://www.servethehome.com/estimating-the-power-consumptio...

replies(1): >>45018493 #

5. mikepurvis ◴[25 Aug 25 20:19 UTC] No.45018493[source]▶

>>45018353 #

I've never done DC ops, but I bet fan failure is a factor too— basically there'd be a benefit to centralizing all the cooling for N racks in 2-3 large redundant pumps rather than having each node bringing its own battalion of fans that are all going to individually fail in a bell curve centered on 30k hours of operation, with each failure knocking out the system and requiring hands-on maintenance.

replies(1): >>45022725 #

6. xadhominemx ◴[25 Aug 25 22:42 UTC] No.45020042[source]▶

>>45018271 (TP) #

It’s because the chips are networked with very high bitrates and need to be physically densely packed together.

7. jabl ◴[26 Aug 25 05:41 UTC] No.45022675[source]▶

>>45018271 (TP) #

> Or is it that wiring distances (1ft = 1nanosecond) make dense computing faster and more efficient?

Contrary to other posters, I'd argue this effect is relatively small. A really good interconnect fabric might give you ping-pong times on the order of 1 microsecond, which is still 1000 times larger than a nanosecond. Most of the delay will be in the switches and the end nodes, not in the signal traveling over the wire or fiber. Say for a large-ish cluster with a diameter of, say, 100 feet (something like 7 rows of racks, each row 100 feet long, give or take), if liquid cooling allows you to double the density, you could condense it to a diameter of 100/sqrt(2) = 70 ft (about 5 rows of 70 ft each). As a ping-pong involves a signal going both ways, the worst-case increase in signal delay would be (100-70)*2 = 60 ft or 60 nanoseconds (in reality somewhat more since cables have to be routed). So about a 6% increase if we assume the baseline is 1 microsecond. Measurable, yes, but likely very small effect on application performance vs. a ping-pong microbenchmark.

Now where it can matter is that by packing the components more closely together, you can connect more chips via backplane and/or copper connectors vs. having to use optics.

8. jabl ◴[26 Aug 25 05:49 UTC] No.45022725{3}[source]▶

>>45018493 #

A cool (ha ha!) solution was the old Cray XT3/4 supercomputers, which were air cooled. But instead of a battalion of tiny fans, each cabinet had a single huge fan at the bottom, blowing air vertically through the cabinet (the boards were mounted vertically). No redundancy, sure, but AFAIU it was reliable enough to not be a problem in practice.

replies(1): >>45025991 #

9. mnw21cam ◴[26 Aug 25 11:53 UTC] No.45025262[source]▶

>>45018271 (TP) #

It's also the fact that a large portion of the power used by a data centre is actually spent on cooling, so anything that will make that more efficient is a cost saving.

10. mikepurvis ◴[26 Aug 25 13:05 UTC] No.45025991{4}[source]▶

>>45022725 #

That’s a similar design principle to the Mac Pro trashcan, I guess, which also pulled air through a central column alongside vertical PCBs/heatsinks.