←back to thread

199 points angadh | 1 comments | | HN request time: 0.313s | source
Show context
GlenTheMachine ◴[] No.44393313[source]
Space roboticist here.

As with a lot of things, it isn't the initial outlay, it's the maintenance costs. Terrestrial datacenters have parts fail and get replaced all the time. The mass analysis given here -- which appears quite good, at first glance -- doesn't including any mass, energy, or thermal system numbers for the infrastructure you would need to have to replace failed components.

As a first cut, this would require:

- an autonomous rendezvous and docking system

- a fully railed robotic system, e.g. some sort of robotic manipulator that can move along rails and reach every card in every server in the system, which usually means a system of relatively stiff rails running throughout the interior of the plant

- CPU, power, comms, and cooling to support the above

- importantly, the ability of the robotic servicing system toto replace itself. In other words, it would need to be at least two fault tolerant -- which usually means dual wound motors, redundant gears, redundant harness, redundant power, comms, and compute. Alternately, two or more independent robotic systems that are capable of not only replacing cards but also of replacing each other.

- regular launches containing replacement hardware

- ongoing ground support staff to deal with failures

The mass analysis also doesn't appear to include the massive number of heat pipes you would need to transfer the heat from the chips to the radiators. For an orbiting datacenter, that would probably be the single biggest mass allocation.

replies(17): >>44393394 #>>44393436 #>>44393528 #>>44393553 #>>44393882 #>>44394969 #>>44395311 #>>44395355 #>>44396009 #>>44396843 #>>44397057 #>>44397975 #>>44398392 #>>44398563 #>>44406204 #>>44410213 #>>44414799 #
vidarh ◴[] No.44395355[source]
I've had actual, real-life deployments in datacentres where we just left dead hardware in the racks until we needed the space, and we rarely did. Typically we'd visit a couple of times a year, because it was cheap to do so, but it'd have totally viable to let failures accumulate over a much longer time horizon.

Failure rates tend to follow a bathtub curve, so if you burn-in the hardware before launch, you'd expect low failure rates for a long period and it's quite likely it'd be cheaper to not replace components and just ensure enough redundancy for key systems (power, cooling, networking) that you could just shut down and disable any dead servers, and then replace the whole unit when enough parts have failed.

replies(8): >>44395420 #>>44395725 #>>44396217 #>>44397041 #>>44397169 #>>44398004 #>>44398178 #>>44398724 #
rajnathani ◴[] No.44395420[source]
Exactly what I was thinking when the OP comment brought up "regular launches containing replacement hardware", this is easily solvable by actually "treating servers as cattle and not pets" whereby one would simply over-provision servers and then simply replace faulty servers around once per year.

Side: Thanks for sharing about the "bathtub curve", as TIL and I'm surprised I haven't heard of this before especially as it's related to reliability engineering (as from searching on HN (Algolia) that no HN post about the bathtub curve crossed 9 points).

replies(2): >>44396389 #>>44397559 #
1. dapperdrake ◴[] No.44397559[source]
Ah, the good old BETA distribution.

Programming and CS people somehow rarely look at that.