Most active commenters

hn-throw(3)

Popular/hot comments

>>42208722 #
>>42209061 #

←back to thread

How oxide cuts data center power consumption in half

(oxide.computer)

1. shivak ◴[21 Nov 24 20:31 UTC] No.42208324[source]▶

>>42206990 (OP) #

> > The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies

This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.

replies(6): >>42208347 #>>42208722 #>>42208748 #>>42208751 #>>42208787 #>>42208961 #

2. walrus01 ◴[21 Nov 24 20:34 UTC] No.42208347[source]▶

>>42208324 (TP) #

The whole thing with eliminating 70 discrete 1U server size AC-to-DC power supplies is nothing new. It's the same general concept as the power distribution unit in the center of an open compute platform rack design from 10+ years ago.

Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.

The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).

replies(2): >>42208552 #>>42209470 #

3. formerly_proven ◴[21 Nov 24 20:53 UTC] No.42208552[source]▶

>>42208347 #

That’s one reason why 2U4N systems are kinda popular. 1/4 the cabling in legacy infrastructure.

4. dralley ◴[21 Nov 24 21:13 UTC] No.42208722[source]▶

>>42208324 (TP) #

>This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff.

I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.

replies(3): >>42208819 #>>42208878 #>>42209079 #

5. jsolson ◴[21 Nov 24 21:15 UTC] No.42208748[source]▶

>>42208324 (TP) #

The bus bar itself is an SPoF, but it's also just dumb copper. That doesn't mean that nothing can go wrong, but it's pretty far into the tail of the failure distribution.

The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.

replies(1): >>42208826 #

6. sidewndr46 ◴[21 Nov 24 21:16 UTC] No.42208751[source]▶

>>42208324 (TP) #

This isn't even remotely close. Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.

In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.

replies(1): >>42209061 #

7. MisterTea ◴[21 Nov 24 21:21 UTC] No.42208787[source]▶

>>42208324 (TP) #

> This creates a single point of failure,

Who told you there is only one PSU in the power shelf?

8. hn-throw ◴[21 Nov 24 21:25 UTC] No.42208819[source]▶

>>42208722 #

Let's say your high quality supply's yearly failure rate is 100 times less than the cheap ones

The probability of at least a single failure is 1-(1-r)^70.

This is quite high even w/out considering the higher quality of the one supply.

The probability of all 70 going down is

r^70 which is absurdly low.

Let's say r = 0.05 or one failed supply every 20 in a year.

1-(1-r)^70 = 97% r^70 < 1E-91

The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.

(Assuming uncorrelated events. YMMV)

replies(1): >>42208883 #

9. immibis ◴[21 Nov 24 21:26 UTC] No.42208826[source]▶

>>42208748 #

It's also plausible to have multiple power supplies feeding the same bus bar in parallel (if they're designed to support this) e.g. one at each end of a row.

replies(1): >>42209260 #

10. fracus ◴[21 Nov 24 21:31 UTC] No.42208878[source]▶

>>42208722 #

No one drives down the highway with one tire either.

replies(1): >>42208899 #

11. carlhjerpe ◴[21 Nov 24 21:32 UTC] No.42208883{3}[source]▶

>>42208819 #

Yeah but the failure rate of an analog piece of copper is pretty low, it'll keep being copper unless you do stupid things. You'll have multiple power supplies provide power on the same piece of copper

replies(1): >>42209249 #

12. AcerbicZero ◴[21 Nov 24 21:34 UTC] No.42208899{3}[source]▶

>>42208878 #

Careful, unicyclists are an unforgiving bunch.

13. sunshowers ◴[21 Nov 24 21:41 UTC] No.42208961[source]▶

>>42208324 (TP) #

Look very carefully at the picture of the rack at https://oxide.computer/ :) there are two power shelves in the middle, not one.

We're absolutely aware of the tradeoffs here and have made quite considered decisions!

14. gruez ◴[21 Nov 24 21:52 UTC] No.42209061[source]▶

>>42208751 #

>Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.

Is this not standard? I vaguely remember that rack severs typically have two PSUs for this reason.

replies(3): >>42209141 #>>42209235 #>>42209457 #

15. shivak ◴[21 Nov 24 21:53 UTC] No.42209079[source]▶

>>42208722 #

A DC busbar can propagate a short circuit across the rack, and DC circuit protection is harder than AC. So of course each server now needs its own current limiter, or a cheap fuse.

But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.

replies(1): >>42209567 #

16. glitchcrab ◴[21 Nov 24 22:00 UTC] No.42209141{3}[source]▶

>>42209061 #

It's highly dependent on the individual server model and quite often how you spec it too. Most 1U Dell machines I worked with in the past only had a single slot for a PSU, whereas the beefier 2U (and above) machines generally came with 2 PSUs.

replies(1): >>42209325 #

17. sidewndr46 ◴[21 Nov 24 22:14 UTC] No.42209235{3}[source]▶

>>42209061 #

you could have 15 PSUs in a server. It doesn't mean they have redundant power feeds

18. hn-throw ◴[21 Nov 24 22:16 UTC] No.42209249{4}[source]▶

>>42208883 #

TL/DR, isnt there a single, shared, DC supply that supplies said piece of copper? Presumably connected to mains?

Or are the running on SOFCs?

replies(1): >>42209634 #

19. eaasen ◴[21 Nov 24 22:17 UTC] No.42209260{3}[source]▶

>>42208826 #

This is how our rack works (Oxide employee). In each power shelf, there are 6 power supplies and only 5 need to be functional to run at full load. If you want even more redundancy, you can use both power shelves with independent power feeds to each so even if you lose a feed, the rack still has 5+1 redundant power supplies.

20. thfuran ◴[21 Nov 24 22:25 UTC] No.42209325{4}[source]▶

>>42209141 #

But 2 PSUs plugged into the same AC supply still have a single point of failure.

replies(1): >>42211740 #

21. jeffbee ◴[21 Nov 24 22:43 UTC] No.42209457{3}[source]▶

>>42209061 #

Rack servers have two PSUs because enterprise buyers are gullible and will buy anything. Generally what happens in case of a single PSU failure is the other PSU also fails or it asserts PROCHOT which means instead of a clean hard down server you have a slow server derping along at 400MHz which is worse in every possible way.

22. jeffbee ◴[21 Nov 24 22:45 UTC] No.42209470[source]▶

>>42208347 #

PDUs are also very failure-prone and not worth the trouble.

23. malfist ◴[21 Nov 24 22:59 UTC] No.42209567{3}[source]▶

>>42209079 #

DC circuit protection is absolutely not harder than AC. DC has the advantage in current flowing in only one direction, not two

replies(1): >>42209948 #

24. mycoliza ◴[21 Nov 24 23:10 UTC] No.42209634{5}[source]▶

>>42209249 #

The big piece of copper is fed by redundant rectifiers. Each power shelf has six independent rectifiers which are 5+1 redundant if the rack is fully loaded with compute sleds, or 3+3 redundant if the rack is half-populated. Customers who want more redundancy can also have a second power shelf with six more rectifiers.

replies(1): >>42210664 #

25. paddy_m ◴[22 Nov 24 00:02 UTC] No.42209948{4}[source]▶

>>42209567 #

Which makes it much harder to break the circuit vs AC

replies(1): >>42210571 #

26. wbl ◴[22 Nov 24 02:16 UTC] No.42210571{5}[source]▶

>>42209948 #

At 48 volts arcing shorts aren't the concern.

27. hn-throw ◴[22 Nov 24 02:35 UTC] No.42210664{6}[source]▶

>>42209634 #

I'm going to assume this is on 3 phase power, but how is the ripple filtered?

replies(1): >>42211584 #