Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.
Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.
Maybe it becomes a big enough profit center to matter. Maybe. At the risk of taking focus away, splitting attention from the mission they're on today: building end user systems.
Maybe they build them for themselves. For what upside? Maybe somewhat better compute efficiency maybe, but I think if you have big workloads the huge massive AMD Turin super-chips are going to be incredibly hard to beat.
It's hard to emphasize just how efficient AMD is, with 192 very high performance cores on a 350-500W chip.
But, yeah, my three 18MW/y racks agree that more power efficiency would be nice, it's just that Rewrite It In (Safe) Rust is unlikely to help with that...
In particular, while I'd enjoy such a device, Apple's whole thing is their whole-system integration and charging a premium because of it, and I'm not sure the markets that want to sell people access to Apple CPUs will pay a premium for a 1U over shoving multiple Mac Minis in the same 1U footprint, especially if they've already been doing that for years at this point...
...I might also speculate that if they did this, they'd have a serious problem, because if they're buying exclusive access to all TSMC's newest fab for extended intervals to meet demand on their existing products, they'd have issues finding sources to meet a potentially substantial demand in people wanting their machines for dense compute. (They could always opt to lag the server platforms behind on a previous fab that's not as competed with, of course, but that feels like self-sabotage if they're already competing with people shoving Mac Minis in a rack, and now the Mac Minis get to be a generation ahead, too?)
Lenovo's DLC systems use 45 degrees C water to directly cool the power supplies and the servers themselves (water goes through them) for > 97% heat transfer to water. In cooler climates, you can just pump this to your drycoolers, and in winter you can freecool them with just air convection.
Yes, the TDP doesn't go down, but cooling costs and efficiency shots up considerably, reducing POE to 1.03 levels. You can put tremendous amount of compute or GPU power in one rack, and cool them efficiently.
Every chassis handles its own power, but IIRC, all the chassis electricity is DC. and the PSUs are extremely efficient.
I didn't see any mention of Rust in the article?
At one point, for many years, it would just sometimes fail to `exec()` a process. This would manifest as a random failure on our build farm about once/twice a month. (This would manifest as "/bin/sh: fail to exec binary file" because the error type from the kernel would have the libc fall back to trying to run the binary as a script, as normal for a Unix, but it isn't a script)
This is likely stemming from their exiting the server business years ago, and focusing on consumer appeal more than robustness (see various terrible releases, security- and stability-wise).
(I'll grant that macOS has many features that would make it a great server OS, but it's just not polished enough in that direction)
Just going up to 60mm or 80mm standard size DC fans can be a huge efficiency increase in watt-hours spent per cubic meters of air moved per hour.
I am extremely skeptical of the "12x" but using larger fans is more efficient.
from the URL linked:
> Bigger fans = bigger efficiency gains Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.
It seems that 0xide was founded in 2019 and Open Compute Project had been specifying dc bus bars for 6 years at that point. People could purchase racks if they wanted, but it seems like, by large, people didn't care enough to go whole hog in on it.
Wonder if the economics have changed or if it's still just neat but won't move the needle.
veering offtopic, did you know macOS is a certified Unix?
https://www.opengroup.org/openbrand/register/brand3581.htm
As I recall, Apple advertised macOS as a Unix without such certification, got sued, and then scrambled to implement the required features to get certification as a result. Here's the story as told by the lead engineer of the project:
https://www.quora.com/What-goes-into-making-an-OS-to-be-Unix...
This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.
Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.
The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).
In general, the telco world concept hasn't changed much. You have AC grid power coming from your local utility into some BIG ASS RECTIFIERS which create -48VDC (and are responsible for charging your BIG ASS BATTERY BANK to float voltage), then various DC fuses/breakers going to distribution of -48VDC bus bars powering the equipment in a CO.
Re: Open Compute, the general concept of what they did was go to a bunch of 1U/2U server power supply manufacturers and get them to make a series of 48VDC-to-12VDC power supplies (which can be 92%+ efficient), and cut out the need for legacy 5VDC feed from power supply into ATX-derived-design x86-64 motherboards.
big ass solid copper busbars
huge gauge copper cables going around a central office (google "telcoflex IV")
big DC breaker/fuse panels
specialized dc fuse panels for power distribution at the top of racks, using little tiny fuses
100% overhead steel ladder rack type cable trays, since your typical telco CO was never a raised floor type environment (UNLIKE legacy 1960s/1970s mainframe computer rooms), so all the power was kept accessible by a team of people working on stepladders.
The same general thing continues today in serious telco/ISP operations, with tech features to bring it into the modern era. The rectifiers are modular now, and there's also rectiverters. Monitoring is much better. People are moving rapidly away from wet cell 2V lead acid battery banks and AGM sealed lead acid stuff to LiFePo4 battery systems.
DC fuse panels can come with network-based monitoring, ability to turn on/off devices remotely.
equipment is a whole lot less power hungry now, a telco CO that has decommed a 5ESS will find itself with a ton of empty thermal and power budget.
when I say serious telco stuff is a lot less power hungry, it's by huge margins. randomly chosen example of radio transport equipment. For instance back in the day a powerful, very expensive point to point microwave radio system might be a full 42U rack, 800W in load, with waveguide going out to antennas on a roof. It would carry one, two or three DS3 equivalent of capacity (45 Mbps each).
now, that same telco might have a radio on its CO roof in the same microwave bands that is 1.3 Gbps FDD capacity, pure ethernet with a SFP+ fiber interface built into it, and the whole radio is a 40W electrical load. The radio is mounted directly on the antenna with some UV/IR resistant weatherproof 16 gauge DC power cable running down into the CO and plugged into a fuse panel.
It's stupid, but that's why we all have jobs.
Stop running so much useless stuff.
Also maybe ARM over x86_64 and similar power-efficiency-oriented hardware.
Rack-level system design, or at least power & cooling design, is certainly also a reasonable thing to do. But standardization is probably important here, rather than some bespoke solution which only one provider/supplier offers.
> How can organizations keep pace with AI innovation as existing data centers run out of available power?
Waste less energy on LLM chatbots?
If you're Amazon or Google, you can do this stuff yourself. If you're a normal company, you probably won't have the inhouse expertise.
On the other hand, Oxide sells a turnkey IaaS platform that you can just roll off the pallet, plug in and start using immediately. You only need to pay one company, and you have one company to yell at if something goes wrong.
You can buy a rack of 1-2U machines from Dell, HPE or Cisco with VMware or some other HCI platform, but you don't get that power efficiency or the really nice control plane Oxide have on their platform.
The goal is that you can email Oxide and they'll be able fix it regardless of where it is in the stack, even down to the processor ROM.
I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.
The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.
In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.
(no affiliation, just a fan)
The probability of at least a single failure is 1-(1-r)^70.
This is quite high even w/out considering the higher quality of the one supply.
The probability of all 70 going down is
r^70 which is absurdly low.
Let's say r = 0.05 or one failed supply every 20 in a year.
1-(1-r)^70 = 97% r^70 < 1E-91
The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.
(Assuming uncorrelated events. YMMV)
There are third party rack mounts available for the Mac Mini and Mac Studio also.
After engineers have the power of implementation and de-implementstion. They need to step into dirty politics and bend other people's views.
It's either theirs or ours. Win-win is a fallacy.
I've read and understood from Joyent and SmartOS that they believe fault tolerant block devices / filesystems is the wrong abstraction, your software should handle losing storage.
We're absolutely aware of the tradeoffs here and have made quite considered decisions!
Need either Apple to get into the general market server business or someone to start designing CPUs as well as Apple (based on the comparison between different ARM cores I'm not sure it really matters if they do so using a specific architecture or not).
Is this not standard? I vaguely remember that rack severs typically have two PSUs for this reason.
But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.
What would you classify Shopify as?
> One existing Oxide user is e-commerce giant Shopify, which indicates the growth potential for the systems available.
* https://blocksandfiles.com/2024/07/04/oxide-ships-first-clou...
Their CEO has tweeted about it:
* https://twitter.com/tobi/status/1793798092212367669
> Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?
Oxide.
This is all a pre-canned solution: just use the API like you would an off-prem cloud. Do you worry about AWS patching stuff? And how many people purchasing 'traditional' servers from Dell/HPe/Lenovo worry about patching links like the LOM?
Further, all of Oxide's stuff is on Github, so you're in better shape for old stuff, whereas if the traditional server vendors EO(S)L something firmware-wise you have no recourse.
A big part of what we're offering our customers is the promise that there's one vendor who's responsible for everything in the rack. We want to be the responsible party for all the software we ship, whether it's firmware, the host operating system, the hypervisor, and everything else. Arguably, the promise that there's one vendor you can yell at for everything is a more important differentiator for us than any particular technical aspect of our hardware or software.
One of the Bluesky team members posted about their requirements earlier this month, and why Oxide isn't a great fit for them at the moment:
They do build it for themselves. From their security blog:
"The root of trust for Private Cloud Compute is our compute node: custom-built server hardware that brings the power and security of Apple silicon to the data center, with the same hardware security technologies used in iPhone, including the Secure Enclave and Secure Boot. We paired this hardware with a new operating system: a hardened subset of the foundations of iOS and macOS tailored to support Large Language Model (LLM) inference workloads while presenting an extremely narrow attack surface. This allows us to take advantage of iOS security technologies such as Code Signing and sandboxing."
I'd rather they use more standardized open source software like Linux, Talos, k8s, Ceph, KubeVirt. Instead of rolling it all themselves on an OS that has a very small niche ecosystem.
But they target large orgs, I wish a solution like this would be accessible for smaller companies.
I wish I could throw their stack on my second hand cots hardware, rent a few U’s in two colos for geo redundancy and cry of happiness each month realizing how much money we save on public cloud cost, yet having cloud capabilities/benefits
But as I’ve grown in my career I’ve actually found that line of thinking refreshing. Can you quantify benefit? If it requires too many assumptions it’s probably not worth it.
But then again there’s always the Vp or the svp who wants to “showcase his towers’ innovative spirit” and then there goes money that could be used for better things. The innovative spirit of the day is random Llm apps.
Windows is also abysmal but it hasn't stopped people from using it.
But yes, it is too much of a desktop OS.
Our early customers include government, finance, and places like Shopify.
You’re not wrong that some places may prefer older companies but that doesn’t mean they all do.
Illumos is not really directly relevant to the customer, it’s a non user facing implementation detail.
We provide security updates.
The real show stopper for years is that ARM servers are just not prepared to be a proper platform. uBoot with grudgingly included FDT (after getting kicked out of Linux kernel) does not make a proper platform, and often there's also no BMC, unique approaches to various parts making the server that one annoying weirdo in the data center, etc.
Cloud providers can spend the effort to backfill necessary features with custom parts, but doing so on your own on-prem is hard
Oh and we write a lot of Typescript too.
But Oxide reason to exist is to keep memory of cool racks from Sun running Solaris alive forever.
This is what I meant by "don't disclose", I didn't mean that Oxide was in any way secretive, but that usually this stuff doesn't get agreed, and that it would make more sense to ask the customer rather than the company selling as Oxide won't want to disclose unless there's already an agreement in place (formal or otherwise).
The value they're offering is that the rack-level consumption and management is improved over the competition, but you should be able to run whatever you want on the actual compute, k8s or whatnot.
This also means you'd not be forever reliant on Oxide.
Oof.
I could not be less convinced by this information that this is a useful indicator for the other 99.999999999% of computing needs.
>We learned that Oxide has so far shipped “under 20 racks,” which illustrates the selective markets its powerful systems are aimed at.
>B&F understands most of those systems were deployed as single units at customer sites. Therefore, Oxide hopes these and new customers will scale up their operations in response to positive outcomes.
Yikes. If they sold 20 racks in July, how many are they up to now?
https://www.supermicro.com/solutions/Solution-Brief-Supermic...
I recall them offering older versions of the specs but can't easily find a reference, so I might be wrong about how accessible they were.
Basically an "opinionated" combination of Dell, Arista, and Pure storage with a special Azure AKS running on top and a metric ton of management and orchestration smarts. The target customer base was telcos who needed local capabilities in their data centers and who might otherwise have gone to OCP.
As far as I can surmise, it's dead, but not EOLed. Microsoft nuked the operator business unit earlier in the year, and judging by recent job postings from contract shops, AT&T might be the only customer.
I'm not hands-on familiar with other serious ARM server market players but for several years now Ampere ARM server CPUs at least are nothing like you describe. Phoronix says it best in https://www.phoronix.com/review/linux-os-ampereone
> All the Linux distributions I attempted worked out effortlessly on this Supermicro AmpereOne server. Like with Ampere Altra and Ampere eMAG before that, it's a seamless AArch64 Linux experience. Thanks to supporting open standards like UEFI, Arm SBSA/SBBR and ACPI and not having to rely on DeviceTrees or other nuisances, installing an AArch64 Linux distribution on Ampere hardware is as easy as in the x86_64 space.
(And for that matter, Oracle's proprietary Solaris seems better maintained than I ever expected, though in this context I think the open source fork is the relevant thing to look at.)
Not necessarily a bad choice; after all, for what shall it profit a man, if he shall gain the whole world, and lose his own soul?
[0] https://www.theregister.com/2024/11/18/llnl_oxide_compute/
You may very well not need the system that we have built, but lots of people do -- and the price point versus the alternatives (public cloud or on-prem commodity HW + pretty price SW) has proven to be pretty compelling. I don't know if we'll ever have a product that hits your price point (which sounds like... the cost of Gigabyte plus a few thousand bucks?), but at least the software is all open source!
As for Apple "uniqueness", I met a lot of people who think that Apple "just" has so much better design team, when it's similar to what you say and the unique part is them being able to properly narrow their design space instead of chasing cost-conscious manufacturers.
So I totally agree with your go-to-market comment, because it’s also a bet against cloud.
I wish them luck though.
I don't get this marketing angle. I've made arguments here before that the cost of compute from a energy perspective is often negligible. If Google Maps, for example, can save you 1 mile due to better routing, then that is several orders of magnitude more efficient [1].
If it uses less resources, it uses less resources. Everybody (businesses and individuals) loves that.
[1]: https://news.ycombinator.com/threads?id=huijzer&next=4206549...