←back to thread

Google's Liquid Cooling

(chipsandcheese.com)
399 points giuliomagnifico | 2 comments | | HN request time: 0s | source
Show context
jonathaneunice ◴[] No.45017586[source]
It’s very odd when mainframes (S/3x0, Cray, yadda yadda) have been extensively water-cooled for over 50 years, and super-dense HPC data centers have used liquid cooling for at least 20, to hear Google-scale data center design compared to PC hobbyist rigs. Selective amnesia + laughably off-target point of comparison.
replies(6): >>45017651 #>>45017716 #>>45018092 #>>45018513 #>>45018785 #>>45021044 #
spankalee ◴[] No.45017716[source]
From the article:

> Liquid cooling is a familiar concept to PC enthusiasts, and has a long history in enterprise compute as well.

And the trend in data centers was to move towards more passive cooling at the individual servers and hotter operating temperatures for a while. This is interesting because it reverses that trend a lot, and possibly because of the per-row cooling.

replies(2): >>45018019 #>>45018087 #
dekhn ◴[] No.45018087[source]
We've basically been watching Google gradually re-discover all the tricks of supercomputing (and other high performance areas) over the past 10+ years. For a long time, websearch and ads were the two main drivers of Google's datacenter architecture, along with services like storage and jobs like mapreduce. I would describe the approach as "horizontal scaling with statistical multiplexing for load balancing".

Those style of jobs worked well but as Google has realized it has more high performance computing with unique workload characteristics that are mission-critical (https://cloud.google.com/blog/topics/systems/the-fifth-epoch...) their infrastructure has had to undergo a lot of evolution to adapt to that.

Google PR has always been full of "look we discovered something important and new and everybody should do it", often for things that were effectively solved using that approach a long time ago. MapReduce is a great example of that- Google certainly didn't invent the concepts of Map or Reduce, or even the idea of using those for doing high throughput computing (and the shuffle phase of MapReduce is more "interesting" from a high performance computing perspective than mapping or reducing anyway).

replies(6): >>45018386 #>>45018588 #>>45018809 #>>45019953 #>>45020485 #>>45021776 #
liquidgecka ◴[] No.45018386[source]
As somebody that worked on Google data centers after coming from a high performance computing world I can categorically say that Google is not “re-learning” old technology. In the early days (when I was there) they focused heavily on moving from thinking of computers to thinking of compute units. This is where containers and self contained data centers came from. This was actually a joke inside of Google because it failed but was copied by all the other vendors for years after Google had given up on it. They then moved to stop thinking about cooling as something that happens within a server case to something that happens to a whole facility. This was the first major leap forward where they moved from cooling the facility and pushing conditioned air in to cooling the air immediately behind the server.

Liquid cooling at Google scale is different than mainframes as well. Mainframes needed to move heat from the core out to the edges of the server where traditional data center cooling would transfer it away to be conditioned. Google liquid cooling is moving the heat completely outside of the building while it’s still liquid. That’s never been done before as far as I am aware. Not at this scale at least.

replies(4): >>45018554 #>>45018655 #>>45018697 #>>45022759 #
mattofak ◴[] No.45018554[source]
It's possible it never made it into production; but when I was helping to commission a 4 rack "supercomputer" circa 2010 we used APC's in-row cooling (which did glycol exchange to the outside but still maintains the hot/cold aisle) and I distinctly remember reading a whitepaper about racks with built in water cooling and the problems with pressure loss, dripless connectors, and corrosion. I no longer recall if the direct cooling loop exited the building or just cycled in the rack to an adjacent secondary heat exchanger. (And I don't remember if it was an APC whitepaper or some other integrator.)

There's also all the fun experiments with dunking the whole server into oil, but I'll give you that again I've only seen setups described with secondary cooling loops - probably because of corrosion and wanting to avoid contaminants.

replies(1): >>45019594 #
1. bri3d ◴[] No.45019594{3}[source]
The parent poster is just either extremely confidently wrong or talking about a very different project from the one in the linked article - here's an article from 2005 with Figure 1 dating from (according to the article) 1965 (!!) showing the same CDU architecture shown in the chipsandcheese article: https://www.electronics-cooling.com/2005/08/liquid-cooling-i...

I do think Google must be doing something right, as their quoted PUE numbers are very strong, but nothing about what's in the linked chipsandcheese article seems groundbreaking at all architecturally, just strong micro-optimization. The article talks a lot about good plate/thermal interface design, good water flow management, use of active flow control valves, and a ton of iteration at scale to find the optimal CDU-to-hardware ratio, but at the end of the day it's the same exact thing in the diagram from 1965.

replies(1): >>45028577 #
2. liquidgecka ◴[] No.45028577[source]
yea I totally missed the CDU. I thought this was a project I had talked with a hardware person about a few years ago where there was no intermediate transfer and when I read the article I completely missed the section between the images. Rack level water cooling is interesting and I am sure they are doing some really cool bits on it but it’s not as revolutionary as a zero transfer system that I had thought they were talking through. I updated the comment to call out my error and reduce my excitement. =/

[I am still annoyed at how many people are dismissive of Google’s datacenter work simply because “severs have been water cooled before” which completely misses the point of datacenter level cooling. I also learned that AWS is doing this already, along with some elements of OVH] =)