Use One Big Server (2022)

(specbranch.com)

350 points antov825 | 2 comments | 31 Aug 25 17:29 UTC | HN request time: 0.419s | source

Show context

runako ◴[31 Aug 25 18:54 UTC] No.45085915[source]▶

One of the more detrimental aspects of the Cloud Tax is that it constrains the types of solutions engineers even consider.

Picking an arbitrary price point of $200/mo, you can get 4(!) vCPUs and 16GB of RAM at AWS. Architectures are different etc., but this is roughly a mid-spec dev laptop of 5 or so years ago.

At Hetzner, you can rent a machine with 48 cores and 128GB of RAM for the same money. It's hard to overstate how far apart these machines are in raw computational capacity.

There are approaches to problems that make sense with 10x the capacity that don't make sense on the much smaller node. Critically, those approaches can sometimes save engineering time that would otherwise go into building a more complex system to manage around artificial constraints.

Yes, there are other factors like durability etc. that need to be designed for. But going the other way, dedicated boxes can deliver more consistent performance without worries of noisy neighbors.

replies(11): >>45086252 #>>45086272 #>>45086760 #>>45087388 #>>45088476 #>>45089414 #>>45091154 #>>45091413 #>>45092146 #>>45092305 #>>45095302 #

benreesman ◴[01 Sep 25 12:45 UTC] No.45092305[source]▶

>>45085915 #

In 2025 if you need convenience and no red tape you've got fly.io in the general case and maybe Vercel or something on a particular framework (there are some good ones for a particular stack).

If your needs go beyond that? Then you need real computers with real configuration and you have OVH/Hetzner/Latitude who will rent you MONSTER machines for the cost of some cheap-ass surplus 2017 Intel on The Cloud.

And if you just want a blog or whatever? Zillion VPS options.

The traditional cloud is for regulatory/process/corruption capture extraction in 2025: its machine economics and developer productivity use case is fucking zero I've seen. Maybe there's some edge case where a completely unencumbered team is better off with DMV trip permissions theatre, remnant Intel racked with noisy neighbors at massive markup, and no support recourse.

replies(1): >>45097639 #

1. nine_k ◴[01 Sep 25 23:46 UTC] No.45097639[source]▶

>>45092305 #

(1) How does fly.io reliability compare to AWS, GCP, or maybe Linode or DO?

(2) What do you do if your large Hetzner server starts to show signs of malfunction? How soon would you be able to replace it, and how easily?

(2a) What do you do when your large Hetzner server just dies? I see that this happens rarely, but what's your contingency plan, if any?

(3) What do you do when your load is highly spiky? Do you reserve bare metal capacity for the biggest peak you expect to serve, because it's so much cheaper than running an elastic serverless architecture of the same capacity anyway?

(4) Considering that your stack still includes many components, how do you manage them, and how expensive is the management overhead? Do you need an extra SRE?

These are not rhetorical questions; I'd love to hear firm real practitioners! (E.g. Stack Overflow used to do deep dives into their few-big-servers architecture.)

replies(1): >>45098647 #

2. runako ◴[02 Sep 25 02:48 UTC] No.45098647[source]▶

>>45097639 (TP) #

These are great questions.

A key factor underlining all of this is understanding, from a business/organizational perspective, your actual uptime requirements. Google may aim at 5 nines with the budget to achieve it, but many banks have routine planned downtime. If you don't know your objectives, you will have trouble making the tradeoffs necessary to get there. As a hypothetical, would your business choose 99.999% uptime (26 seconds down on average per month) vs 99.99% (4.3 min) if that caused infra costs to rise by 50% or more? If you said we can cut our infra costs by 50% by planning a short weekly maintenance window, how would that resonate?

Speaking to a few, in my experience:

2) (not at Hetzner specifically, but at a dedicated host). You have backups & recovery plans, and redundancy where it makes sense. You might run your database with a replica. If you are serving Web traffic, maybe you keep a hot spare. Also, you are still allowed to use e.g. cloud services if it makes sense to do so so you can backup to S3 and use things like SQS or KMS if you don't want to run them yourself. It's worth noting that you may not get advance notice; I recall our service being impacted by a fire at a datacenter that IIRC was caused by a traffic accident on a nearby highway. The point is you have to design resilience into the system. Fortunately, this is well-trod ground.

It would not be a terrible failover option to have something like an autoscale group at AWS ready to step in if the dedicated cluster goes offline. Keep that cluster scaled to 0 until it's needed. Put the cloud behind your cheap dedicated capacity.

3) See above. In my case, we over-provisioned because it's cheap to do so. I did not do this at the time, but I would probably look at running a replicated database with a hot standby on another server.

4) It has not been my experience that "modern" cloud deployments require fewer SRE resources. Like water running downhill, cloud projects seek complexity.

↑