Most active commenters
  • vidarh(6)
  • dapperdrake(3)

←back to thread

199 points angadh | 33 comments | | HN request time: 1.299s | source | bottom
Show context
GlenTheMachine ◴[] No.44393313[source]
Space roboticist here.

As with a lot of things, it isn't the initial outlay, it's the maintenance costs. Terrestrial datacenters have parts fail and get replaced all the time. The mass analysis given here -- which appears quite good, at first glance -- doesn't including any mass, energy, or thermal system numbers for the infrastructure you would need to have to replace failed components.

As a first cut, this would require:

- an autonomous rendezvous and docking system

- a fully railed robotic system, e.g. some sort of robotic manipulator that can move along rails and reach every card in every server in the system, which usually means a system of relatively stiff rails running throughout the interior of the plant

- CPU, power, comms, and cooling to support the above

- importantly, the ability of the robotic servicing system toto replace itself. In other words, it would need to be at least two fault tolerant -- which usually means dual wound motors, redundant gears, redundant harness, redundant power, comms, and compute. Alternately, two or more independent robotic systems that are capable of not only replacing cards but also of replacing each other.

- regular launches containing replacement hardware

- ongoing ground support staff to deal with failures

The mass analysis also doesn't appear to include the massive number of heat pipes you would need to transfer the heat from the chips to the radiators. For an orbiting datacenter, that would probably be the single biggest mass allocation.

replies(17): >>44393394 #>>44393436 #>>44393528 #>>44393553 #>>44393882 #>>44394969 #>>44395311 #>>44395355 #>>44396009 #>>44396843 #>>44397057 #>>44397975 #>>44398392 #>>44398563 #>>44406204 #>>44410213 #>>44414799 #
1. vidarh ◴[] No.44395355[source]
I've had actual, real-life deployments in datacentres where we just left dead hardware in the racks until we needed the space, and we rarely did. Typically we'd visit a couple of times a year, because it was cheap to do so, but it'd have totally viable to let failures accumulate over a much longer time horizon.

Failure rates tend to follow a bathtub curve, so if you burn-in the hardware before launch, you'd expect low failure rates for a long period and it's quite likely it'd be cheaper to not replace components and just ensure enough redundancy for key systems (power, cooling, networking) that you could just shut down and disable any dead servers, and then replace the whole unit when enough parts have failed.

replies(8): >>44395420 #>>44395725 #>>44396217 #>>44397041 #>>44397169 #>>44398004 #>>44398178 #>>44398724 #
2. rajnathani ◴[] No.44395420[source]
Exactly what I was thinking when the OP comment brought up "regular launches containing replacement hardware", this is easily solvable by actually "treating servers as cattle and not pets" whereby one would simply over-provision servers and then simply replace faulty servers around once per year.

Side: Thanks for sharing about the "bathtub curve", as TIL and I'm surprised I haven't heard of this before especially as it's related to reliability engineering (as from searching on HN (Algolia) that no HN post about the bathtub curve crossed 9 points).

replies(2): >>44396389 #>>44397559 #
3. Coffeewine ◴[] No.44395725[source]
It would be interesting to see if the failure rate across time holds true after a rocket launch and time spent in space. My guess is that it wouldn’t, but that’s just a guess.
replies(1): >>44396097 #
4. vidarh ◴[] No.44396097[source]
I think it's likely the overall rate would be higher, and you might find you need more aggressive burn-in, but even then you'd need an extremely high failure rate before it's more efficient to replace components than writing them off.
replies(1): >>44397349 #
5. asah ◴[] No.44396217[source]
serious q: how much extra failure rate would you expect from the physical transition to space?

on one hand, I imagine you'd rack things up so the whole rack/etc moves as one into space, OTOH there's still movement and things "shaking loose" plus the vibration, acceleration of the flight and loss of gravity...

replies(2): >>44396509 #>>44396866 #
6. btown ◴[] No.44396389[source]
https://accendoreliability.com/the-bath-tub-curve-explained/ is an interesting breakdown of bath tub curve dynamics for those curious!
replies(1): >>44396903 #
7. schmidtleonard ◴[] No.44396509[source]
Yes, an orbital launch probably resets the bathtub to some degree.
8. lumost ◴[] No.44396866[source]
I suspect the thermal system would look very different from a terrestrial component. Fans and connectors can shake loose - but do nothing in space.

Perhaps the server would be immersed in a thermally conductive resin to avoid parts shaking loose? If the thermals are taken care of by fixed heat pipes and external radiators - non thermally conductive resins could be used.

replies(3): >>44400012 #>>44400133 #>>44426364 #
9. rtkwe ◴[] No.44396903{3}[source]
Wonder if you could game that in theory by burning in the components on the surface before launch or if the launch would cause a big enough spike from the vibration damage that it's not worth it.
replies(2): >>44397554 #>>44397555 #
10. VectorLock ◴[] No.44397041[source]
The original article even addresses this directly. Plus hardware returns over fast enough that you'll simply be replacing modules with a smattering of dead servers with entirely new generations anyways.
replies(1): >>44397598 #
11. TheOtherHobbes ◴[] No.44397169[source]
The analysis has zero redundancy for either servers or support systems.

Redundancy is a small issue on Earth, but completely changes the calculations for space because you need more of everything, which makes the already-unfavourable space and mass requirements even less plausible.

Without backup cooling and power one small failure could take the entire facility offline.

And active cooling - which is a given at these power densities - requires complex pumps and plumbing which have to survive a launch.

The whole idea is bonkers.

IMO you'd be better off thinking about a swarm of cheaper, simpler, individual serversats or racksats connected by a radio or microwave comms mesh.

I have no idea if that's any more economic, but at least it solves the most obvious redundancy and deployment issues.

replies(4): >>44397239 #>>44397634 #>>44397680 #>>44399942 #
12. conradev ◴[] No.44397239[source]
Many small satellites also increases the surface area for cooling
replies(1): >>44401864 #
13. MobiusHorizons ◴[] No.44397349{3}[source]
The bathtub curve isn’t the same for all components of a server though. Writing off the entire server because a single ram chip or ssd or network card failed would limit the entire server to the lifetime of the weakest part. I think you would want redundant hot spares of certain components with lower mean time between failures.
replies(1): >>44397526 #
14. vidarh ◴[] No.44397526{4}[source]
We do often write off an entire server because a single component fails because the lifetime of the shortest-lifetime components is usually long enough that even on-earth with easy access it's often not worth the cost to try to repair. In an easy-to-access data centre, the component most likely to get replaced would be hot-swappable drives or power supplies, but it's been about 2 decades since the last time I worked anywhere where anyone bothered to check for failed RAM or failed CPUs to salvage a server. And lot of servers don't have network devices you can replace without soldering, and haven't for a long time outside of really high end networking.

And at sufficient scale, once you plan for that it means you can massively simplify the servers. The amount of waste a sever case suitable for hot-swapping drives adds if you're not actually going to use the capability is massive.

15. vidarh ◴[] No.44397554{4}[source]
I suspect you'd absolutely want to burn in before launch, maybe even including simulating some mechanical stress to "shake out" more issues, but it is a valid question how much burn in is worth doing before and after launch.
replies(1): >>44398453 #
16. dapperdrake ◴[] No.44397555{4}[source]
Maybe they are different types of failure modes. Solar panel semiconductors hate vibration.

And then, there is of course radiation trouble.

So those two kinds of burn-in require a launch ti space anyway.

17. dapperdrake ◴[] No.44397559[source]
Ah, the good old BETA distribution.

Programming and CS people somehow rarely look at that.

18. dapperdrake ◴[] No.44397598[source]
Really? Even radiation hardened hardware? Aren’t there way higher size floors on the transistors?
19. tessierashpool ◴[] No.44397634[source]
even a swarm of satellites has risk factors. we treat space as if it were empty (it's in the name) but there's debris left over from previous missions. this stuff orbits at a very high velocity, so if an object greater than 10cm is projected to get within a couple kilometers of the ISS, they move the ISS out of the way. they did this in April and it happens about once a year.

the more satellites you put up there, the more it happens, and the greater the risk that the immediate orbital zone around Earth devolves into an impenetrable whirlwind of space trash, aka Kessler Syndrome.

20. vidarh ◴[] No.44397680[source]
> The analysis has zero redundancy for either servers or support systems.

The analysis is a third party analysis that among other things presumes they'll launch unmodified Nvidia racks, which would make no sense. It might be this means Starcloud are bonkers, but it might also mean the analysis is based on flawed assumptions about what they're planning to do. Or a bit of both.

> IMO you'd be better off thinking about a swarm of cheaper, simpler, individual serversats or racksats connected by a radio or microwave comms mesh.

This would get you significantly less redundancy other than against physical strikes than having the same redundancy in a single unit and letting you control what feeds what, the same way we have smart, redundant power supplies and cooling in every data center (and in the racks they're talking about using as the basis).

If power and cooling die faster than the servers, you'd either need to overprovision or shut down servers to compensate, but it's certainly not all or nothing.

21. 4ndrewl ◴[] No.44398004[source]
A new meaning to the term "space junk"
22. drewg123 ◴[] No.44398178[source]
I'd naively assume that the stress of launch (vibration, G-forces) would trigger failures in hardware that had been working on the ground. So I'd expect to see a large-ish number of failures on initial bringup in space.
replies(2): >>44398286 #>>44398477 #
23. Retric ◴[] No.44398286[source]
Electronics can be extremely resilient to vibration and g forces. Self guided artillery shells such as the M982 Excalibur include fairly normal electronics for GPS guidance. https://en.wikipedia.org/wiki/M982_Excalibur
24. 0xffff2 ◴[] No.44398453{5}[source]
Vibration testing is a completely standard part of space payload pre-flight testing. You would absolutely want to vibe-test (no, not that kind) at both a component level and fully integrated before launch.
replies(1): >>44400645 #
25. 0xffff2 ◴[] No.44398477[source]
On the ground vibration testing is a standard part of pre-launch spacecraft testing. This would trigger most (not all) vibration/G-force related failures on the ground rather than at the actual launch.
replies(1): >>44399208 #
26. geon ◴[] No.44398724[source]
Yes. I think I read a blogpost from Backblaze about running their Red Pod rack mounted chassis some 10 years ago.

They would just keep the failed drives in the chassi. Maybe swap out the entire chassi if enough drives died.

27. rtkwe ◴[] No.44399208{3}[source]
The big question mark is how many failures you cause and catch on the first cycle and how much you're just putting extra wear on the components that pass the test the first time and don't get replaced.
28. aerophilic ◴[] No.44399942[source]
There is a neat solve for the thermal problem that York Space systems has been advocating (based on Russian tech)… put everything in an enclosure.

https://www.yorkspacesystems.com/

Short version: make a giant pressure vessel and keep things at 1 atm. Circulate air like you would do on earth. Yes, there is still plenty of excess heat you need to radiate, but dramatically simplifies things.

29. kevin_thibedeau ◴[] No.44400012{3}[source]
Connectors have to survive the extreme vibration of a rocket launch. Parts routinely shake off boards in testing even when using non-COTS space rated packaging designed for extreme environments. That amplifies the cost of everything.

The Russians are the only ones who package their unmanned platform electronics in pressure vessels. Everyone else operates in vacuum, so no fans.

30. 83 ◴[] No.44400133{3}[source]
>>immersed in a thermally conductive resin

sounds heavy

31. btown ◴[] No.44400645{6}[source]
PSA: do not vibe-code the hardware controller for your vibration testing rig. This does not pass the vibe test.
32. AtlasBarfed ◴[] No.44401864{3}[source]
Like a neo-fractal surface? There's no atmosphere to wear it down.
33. vidarh ◴[] No.44426364{3}[source]
The racks mentioned in the analysis are liquid-cooled.