←back to thread

Google's Liquid Cooling

(chipsandcheese.com)
399 points giuliomagnifico | 1 comments | | HN request time: 0.221s | source
Show context
jonathaneunice ◴[] No.45017586[source]
It’s very odd when mainframes (S/3x0, Cray, yadda yadda) have been extensively water-cooled for over 50 years, and super-dense HPC data centers have used liquid cooling for at least 20, to hear Google-scale data center design compared to PC hobbyist rigs. Selective amnesia + laughably off-target point of comparison.
replies(6): >>45017651 #>>45017716 #>>45018092 #>>45018513 #>>45018785 #>>45021044 #
liquidgecka ◴[] No.45018513[source]
[bri3d pointed out that I missed an element of this. There is a transfer between rack level and machine level coolant which makes this far less novel than I had initially understood. See their direct comment to this]

I posed this further down in a reply-to-a-reply but I should call it out a little closer to the top: The innovation here is not “we are using water for cooling”. The innovation here is that they are direct cooling the servers with chillers that are outside of the facility. Most mainframes will use water cooling to get the heat from the core out to the edges where traditional where it can be picked up by the typical heatsink/cooling fans. Even home PCs do this by moving the heat to a reservoir that can be more effectively cooled.

What Google is doing is using the huge chillers that would normally be cooling the air in the facility to cool water which is directly pumped into every server. The return water is then cooled in the chiller tower. This eliminates ANY air based transfer besides the chiller tower. This is one being done a server or a rack.. its being done on the whole data center all at once.

I am super curious how they handle things like chiller maintenance or pump failures. I am sure they have redundancy but the system for that has to be super impressive because it can’t be offline long before you experience hardware failure!

[Edit: It was pointed out in another comment that AWS is doing this as well and honestly their pictures make it way clearer what is happening: https://www.aboutamazon.com/news/aws/aws-liquid-cooling-data...]

replies(5): >>45018536 #>>45018749 #>>45018898 #>>45019376 #>>45023339 #
nitwit005 ◴[] No.45018749[source]
This was before I was born, so I'm hardly an expert, but I've heard of feeding IBM mainframes chilled water. A quick check of wikipedia found some mention of the idea: https://en.wikipedia.org/wiki/IBM_3090
replies(2): >>45018879 #>>45019802 #
1. jauntywundrkind ◴[] No.45019802[source]
Having to pre chill water (via a refrigeration cycle) is radically less efficient than being able to collect and then disperse heat. It generates considerably more heat ahead of time, to deliver the chilled water. This mode of gathering the heat and sending it out, dealing with the heat after it is produced rather than in advance, should be much more energy efficient.

I don't know what surprises me about it so much, but having these rack-sized CDU heat-exchangers was quite a surprise, quite novel to me. Having a relatively small closed loop versus one big loop that has to go outside seems like a very big tradeoff, with a somewhat material and space intensive demand (a rack with 6x CDUs), but the fine grained control does seem obviously sweet to have. I wish there were a little more justification for the use of heat exchangers!

The way water is distributed within the server is also pretty amazing, with each server having it's own "bus bar" of water, and each chip having it's own active electro-mechanical valve to control it's specific water flow. The TPUv3 design where cooling happens serially, each chip in sequence getting hotter and hotter water seems common-ish, where-as with TPUv4 there's a fully parallel and controllable design.

Also the switch from lidded chips to bare chips, with a cold plate that comes down to just above, channeling water is one of those very detailed fine-grained optimizations that is just so sweet.