Most active commenters

motorest(4)
sgarland(4)
talles(3)
icedchai(3)
toast0(3)

Popular/hot comments

>>45086005 #

←back to thread

Use One Big Server (2022)

(specbranch.com)

Show context

talles ◴[31 Aug 25 18:06 UTC] No.45085392[source]▶

>>45085029 (OP) #

Don't forget the cost of managing your one big server and the risk of having such single point of failure.

replies(8): >>45085441 #>>45085488 #>>45085534 #>>45085637 #>>45086579 #>>45088964 #>>45090596 #>>45091993 #

1. Puts ◴[31 Aug 25 18:20 UTC] No.45085534[source]▶

>>45085392 #

My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures. One server is the simplest and most reliable setup, and if you have backup and automated provisioning you can just re-deploy your entire environment in less than the time it takes to debug a complex multi-server setup.

I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.

replies(7): >>45085607 #>>45085628 #>>45085635 #>>45086355 #>>45088375 #>>45088512 #>>45091645 #

2. ocdtrekkie ◴[31 Aug 25 18:26 UTC] No.45085607[source]▶

>>45085534 (TP) #

My single on-premise Exchange server is drastically more reliable than Microsoft's massive globally resilient whatever Exchange Online, and it costs me a couple hours of work on occasion. I probably have half their downtime, and most of mine is scheduled when nobody needs the server anyhow.

I'm not a better engineer, I just have drastically fewer failure modes.

replies(1): >>45085642 #

3. motorest ◴[31 Aug 25 18:28 UTC] No.45085628[source]▶

>>45085534 (TP) #

> My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures.

I think you misread OP. "Single point of failure" doesn't mean the only failure modes are hardware failures. It means that if something happens to your nodes whether it's hardware failure or power outage or someone stumbling on your power/network cable, or even having a single service crashing, this means you have a major outage on your hands.

These types of outages are trivially avoided with a basic understanding of well-architected frameworks, which explicitly address the risk represented by single points of failure.

replies(1): >>45086005 #

4. talles ◴[31 Aug 25 18:29 UTC] No.45085635[source]▶

>>45085534 (TP) #

I also have seem the opposite somewhat frenquently: some team screws up the server and unrelated stable services that are running since forever (on the same server) are now affected due messing up the environment.

5. talles ◴[31 Aug 25 18:30 UTC] No.45085642[source]▶

>>45085607 #

Do you develop and manage the server alone? It's a quite a different reality when you have a big team.

replies(1): >>45086229 #

6. fogx ◴[31 Aug 25 19:03 UTC] No.45086005[source]▶

>>45085628 #

don't you think it's highly unlikely that someone will stumble over the power cable in a hosted datacenter like hetzner? and even if, you could just run a provisioned secondary server that jumps in if the first becomes unavailable and still be much cheaper.

replies(3): >>45086298 #>>45086456 #>>45089501 #

7. ocdtrekkie ◴[31 Aug 25 19:24 UTC] No.45086229{3}[source]▶

>>45085642 #

Mostly myself but I am able to grab a few additional resources when needed. (Server migration is still, in fact, not fun!)

8. icedchai ◴[31 Aug 25 19:31 UTC] No.45086298{3}[source]▶

>>45086005 #

It's unlikely, but it happens. In the mid 2000's I had some servers at a colo. They were doing electrical work and took out power to a bunch of racks, including ours. Those environments are not static.

9. jeffrallen ◴[31 Aug 25 19:38 UTC] No.45086355[source]▶

>>45085534 (TP) #

Not to mention the other leading cause of outages: UPS's.

Sigh.

replies(1): >>45087404 #

10. toast0 ◴[31 Aug 25 19:50 UTC] No.45086456{3}[source]▶

>>45086005 #

I don't know about Hetzner, but the failure case isn't usually tripping over power plugs. It's putting a longer server in the rack above/below yours and pushing the power plug out of the back of your server.

Either way, stuff happens, figuring out what your actual requirements around uptime, time to response, and time to resolution is important before you build a nine nines solution when eight eights is sufficient. :p

replies(1): >>45090840 #

11. icedchai ◴[31 Aug 25 21:47 UTC] No.45087404[source]▶

>>45086355 #

UPSes always seem to have strange failure modes. I've had a couple fail after a power failure. The batteries died and they wouldn't come back up automatically when the power came back. They didn't warn me about the dead battery until after...

replies(1): >>45088380 #

12. sgarland ◴[01 Sep 25 00:38 UTC] No.45088375[source]▶

>>45085534 (TP) #

Yep. I know people will say, “it’s just a homelab,” but hear me out: I’ve ran positively ancient Dell R620s in a Proxmox cluster for years. At least five. Other than moving them from TX to NC, the cluster has had 100% uptime. When I’ve needed to do maintenance, I drop one at a time, and it maintains quorum, as expected. I’ll reiterate that this is on circa-2012 hardware.

In all those years, I’ve had precisely one actual hardware failure: a PSU went out. They’re redundant, so nothing happened, and I replaced it.

Servers are remarkably resilient.

EDIT: 100% uptime modulo power failure. I have a rack UPS, and a generator, but once I discovered the hard way that the UPS batteries couldn’t hold a charge long enough to keep the rack up while I brought the generator online.

replies(1): >>45088814 #

13. sgarland ◴[01 Sep 25 00:39 UTC] No.45088380{3}[source]▶

>>45087404 #

That’s why they have self-tests. Learned that one the hard way myself.

replies(1): >>45092446 #

14. api ◴[01 Sep 25 01:07 UTC] No.45088512[source]▶

>>45085534 (TP) #

A lot of this attitude comes from the bad old days of 90s and early 2000s spinning disk. Those things failed a lot. It made everyone think you are going to have constant outages if you don’t cluster everything.

Today’s systems don’t fail nearly as often if you use high quality stuff and don’t beat the absolute hell out of SSD. Another trick is to overprovision SSD to allow wear leveling to work better and reduce overall write load.

Do that and a typical box will run years and years with no issues.

15. whartung ◴[01 Sep 25 02:10 UTC] No.45088814[source]▶

>>45088375 #

Being as I love minor disaster anecdotes where doing all the "right things" seem to not make any difference :).

We had a rack in data center, and we wanted to put local UPS on critical machines in the rack.

But the data center went on and on about their awesome power grid (shared with a fire station, so no administrative power loss), on site generators, etc., and wouldn't let us.

Sure enough, one day the entire rack went dark.

It was the power strip on the data centers rack that failed. All the backups grids in the world can't get through a dead power strip.

(FYI, family member lost their home due to a power strip, so, again, anecdotally, if you have any older power strips (5-7+ years) sitting under your desk at home, you may want to consider swapping it out for a new one.)

replies(1): >>45092579 #

16. motorest ◴[01 Sep 25 04:36 UTC] No.45089501{3}[source]▶

>>45086005 #

> don't you think it's highly unlikely that someone will stumble over the power cable in a hosted datacenter like hetzner?

You're not getting the point. The point is that if you use a single node to host your whole web app, you are creating a system where many failure modes, which otherwise could not even be an issue, can easily trigger high-severity outages.

> and even if, you could just run a provisioned secondary server (...)

Congratulations, you are no longer using "one big server", thus defeating the whole purpose behind this approach and learning the lesson that everyone doing cloud engineering work is already well aware.

replies(1): >>45090616 #

17. juped ◴[01 Sep 25 08:05 UTC] No.45090616{4}[source]▶

>>45089501 #

Do you actually think dead simple failover is comparable to elastic kubernetes whatever?

replies(1): >>45091327 #

18. kapone ◴[01 Sep 25 08:49 UTC] No.45090840{4}[source]▶

>>45086456 #

> It's putting a longer server in the rack above/below yours and pushing the power plug out of the back of your server

Are you serious? Have you ever built/operated/wired rack scale equipment? You think the power cables for your "short" server (vs the longer one being put in) are just hanging out in the back of the rack?

Rack wiring has been done and done correctly for ages. Power cables on one side (if possible), data and other cables on the other side. These are all routed vertically and horizontally, so they land only on YOUR server.

You could put a Mercedes Maybach above/below your server and nothing would happen.

replies(1): >>45094297 #

19. motorest ◴[01 Sep 25 10:15 UTC] No.45091327{5}[source]▶

>>45090616 #

> Do you actually think dead simple failover is comparable to elastic kubernetes whatever?

References to "elastic Kubernetes whatever" is a red herring. You can have a dead simple load balancer spreading traffic across multiple bare metal nodes.

replies(1): >>45092123 #

20. Aeolun ◴[01 Sep 25 11:02 UTC] No.45091645[source]▶

>>45085534 (TP) #

In my experience, my personal services have gone down exactly zero times. Actually not entirely true, but every time they stopped working the servers had simply run out of disk space.

The number of production incidents on our corporate mishmash of lambda, ecs, rds, fargate, ec2, eks etc? It’s a good week when something doesn’t go wrong. Somehow the logging setup is better on the personal stuff too.

21. juped ◴[01 Sep 25 12:16 UTC] No.45092123{6}[source]▶

>>45091327 #

Thanks for switching sides to oppose yourself, I guess?

replies(1): >>45099782 #

22. icedchai ◴[01 Sep 25 13:10 UTC] No.45092446{4}[source]▶

>>45088380 #

My UPS was supposedly "self testing" itself periodically and it still happened!

replies(1): >>45092679 #

23. sgarland ◴[01 Sep 25 13:31 UTC] No.45092579{3}[source]▶

>>45088814 #

For sure, things can and will go wrong. For critical services, I’d want to split them up into separate racks for precisely that reason.

Re: power strips, thanks for the reminder. I’m usually diligent about that, but forgot about one my wife uses. Replacement coming today.

24. sgarland ◴[01 Sep 25 13:50 UTC] No.45092679{5}[source]▶

>>45092446 #

Oof, sorry.

25. toast0 ◴[01 Sep 25 16:44 UTC] No.45094297{5}[source]▶

>>45090840 #

Yes I'm serious. My managed host took several of our machines offline when racking machines under/over ours. And they said it was because the new machines were longer and knocked out the power cables on ours.

We were their largest customer and they seemed honest even when they made mistakes that seemed silly, so we rolled our eyes and moved on with life.

Managed hosting means accepting that you can't inspect the racks and chide people for not cabling to your satisfaction. And mistakes by the managed host will impact your availability.

replies(1): >>45101233 #

26. motorest ◴[02 Sep 25 06:36 UTC] No.45099782{7}[source]▶

>>45092123 #

> Thanks for switching sides to oppose yourself, I guess?

I'm baffled by your comment. Are you sure you read what I wrote?

27. kapone ◴[02 Sep 25 10:37 UTC] No.45101233{6}[source]▶

>>45094297 #

I hope that "managed host" got fired in a heartbeat and you moved elsewhere. Because they don't know WTF they're doing. As simple as that.

replies(1): >>45105768 #

28. toast0 ◴[02 Sep 25 16:57 UTC] No.45105768{7}[source]▶

>>45101233 #

We did eventually move elsewhere because of acquisition. Of course those guys didn't even bother to run LACP and so our systems would regularly go offline for a bit whenever someone wanted to update a switch. I was a lot happier at the host that sometimes bumped the power cables.

Firing a host where you've got thousands of servers is easier said than done. We did do a quote exercise with another provider that could have supported us, and it didn't end up very competitive ... and it wouldn't have been worth the transition. Overall, there were some derpy moments, but I don't think we would have been happier anywhere else, and we didn't want to rent cages and run our own servers.

↑