Google's Liquid Cooling

(chipsandcheese.com)

399 points giuliomagnifico | 3 comments | 25 Aug 25 17:57 UTC | HN request time: 0s | source

Show context

jonathaneunice ◴[25 Aug 25 18:58 UTC] No.45017586[source]▶

It’s very odd when mainframes (S/3x0, Cray, yadda yadda) have been extensively water-cooled for over 50 years, and super-dense HPC data centers have used liquid cooling for at least 20, to hear Google-scale data center design compared to PC hobbyist rigs. Selective amnesia + laughably off-target point of comparison.

replies(6): >>45017651 #>>45017716 #>>45018092 #>>45018513 #>>45018785 #>>45021044 #

liquidgecka ◴[25 Aug 25 20:20 UTC] No.45018513[source]▶

>>45017586 #

[bri3d pointed out that I missed an element of this. There is a transfer between rack level and machine level coolant which makes this far less novel than I had initially understood. See their direct comment to this]

I posed this further down in a reply-to-a-reply but I should call it out a little closer to the top: The innovation here is not “we are using water for cooling”. The innovation here is that they are direct cooling the servers with chillers that are outside of the facility. Most mainframes will use water cooling to get the heat from the core out to the edges where traditional where it can be picked up by the typical heatsink/cooling fans. Even home PCs do this by moving the heat to a reservoir that can be more effectively cooled.

What Google is doing is using the huge chillers that would normally be cooling the air in the facility to cool water which is directly pumped into every server. The return water is then cooled in the chiller tower. This eliminates ANY air based transfer besides the chiller tower. This is one being done a server or a rack.. its being done on the whole data center all at once.

I am super curious how they handle things like chiller maintenance or pump failures. I am sure they have redundancy but the system for that has to be super impressive because it can’t be offline long before you experience hardware failure!

[Edit: It was pointed out in another comment that AWS is doing this as well and honestly their pictures make it way clearer what is happening: https://www.aboutamazon.com/news/aws/aws-liquid-cooling-data...]

replies(5): >>45018536 #>>45018749 #>>45018898 #>>45019376 #>>45023339 #

ambicapter ◴[25 Aug 25 20:22 UTC] No.45018536[source]▶

>>45018513 #

So every time they plug in a server they also plug in water lines?

replies(5): >>45018606 #>>45018608 #>>45018698 #>>45019151 #>>45020013 #

ajb ◴[25 Aug 25 20:39 UTC] No.45018698[source]▶

>>45018536 #

I remember reading somewhere that they don't operate at the level of servers; if one dies they leave it in place until they're ready to replace the whole rack. Don't know if that's true now, though.

It does sound like connections do involve water lines though. As they are isolating different water circuits, in theory they could have a dry connection between heat exchanger plates, or one made through thermal paste. It doesn't sound like they're doing that though.

replies(2): >>45019104 #>>45019128 #

1. liquidgecka ◴[25 Aug 25 21:15 UTC] No.45019128{3}[source]▶

>>45018698 #

It has not been true for a LONG time. That was part of Google early “compute unit” strategy that involved things like sealed containers and such. Turns out that’s not super efficient or useful because you leave large swaths of hardware idle.

In my day we had software that would “drain” a machine and release it to hardware ops to swap the hardware on. This could be a drive, memory, CPU or a motherboard. If it was even slightly complicated they would ship it to Mountain View for diagnostic and repair. But every machine was expected to be cycled to get it working as fast as possible.

We did a disk upgrade on a whole datacenter that involved switching from 1TB to 2TB disks or something like that (I am dating myself) and total downtime was so important they hired temporary workers to work nights to get the swap done as quickly as possible. If I remember correctly that was part of the “holy cow gmail is out of space!” chaos though, so added urgency.

replies(2): >>45022375 #>>45025880 #

2. throwaway2037 ◴[26 Aug 25 04:47 UTC] No.45022375[source]▶

>>45019128 (TP) #

    > part of the “holy cow gmail is out of space!” chaos

This sounds like an interesting story. Can you share more details.

3. Cthulhu_ ◴[26 Aug 25 12:54 UTC] No.45025880[source]▶

>>45019128 (TP) #

I'd love to work in a datacenter at that scale sometime, it sounds like it's like working in a warehouse where you get a list of orders, servers to remove and pick up, but at the scales of the Googles et al, that's hundreds of server replacements a day, and production lines of new servers being built and existing ones being repaired or decommissioned.

It's a fascinating industry, but only in my head as the only info you get about it is carefully polished articles and the occasional anecdote on HN, which is also carefully polished due to NDAs.

↑