So you want to build your own data center

(blog.railway.com)

596 points dban | 4 comments | 17 Jan 25 20:41 UTC | HN request time: 0.837s | source

Show context

dban ◴[17 Jan 25 20:47 UTC] No.42743076[source]▶

>>42743019 (OP) #

This is our first post about building out data centers. If you have any questions, we're happy to answer them here :)

replies(3): >>42743553 #>>42744958 #>>42752255 #

1. gschier ◴[17 Jan 25 21:38 UTC] No.42743553[source]▶

>>42743076 #

How do you deal with drive failures? How often does a Railway team member need to visit a DC? What's it like inside?

replies(1): >>42743681 #

2. justjake ◴[17 Jan 25 21:55 UTC] No.42743681[source]▶

>>42743553 (TP) #

Everything is dual redundancy. We run RAID so if a drive fails it's fine; alerting will page oncall which will trigger remote hands onsite, where we have spares for everything in each datacenter

replies(1): >>42743725 #

3. gschier ◴[17 Jan 25 22:00 UTC] No.42743725[source]▶

>>42743681 #

How much additional overhead is there for managing the bare-metal vs cloud? Is it mostly fine after the big effort for initial setup?

replies(1): >>42743901 #

4. ca508 ◴[17 Jan 25 22:22 UTC] No.42743901{3}[source]▶

>>42743725 #

We built some internal tooling to help manage the hosts. Once a host is onboarded onto it, it's a few button clicks on an internal dashboard to provision a QEMU VM. We made a custom ansible inventory plugin so we can manage these VMs the same as we do machines on GCP.

The host runs a custom daemon that programs FRR (an OSS routing stack), so that it advertises addresses assigned to a VM to the rest of the cluster via BGP. So zero config of network switches, etc... required after initial setup.

We'll blog about this system at some point in the coming months.

↑