←back to thread

621 points sebg | 1 comments | | HN request time: 0.204s | source
Show context
stapedium ◴[] No.43717547[source]
I’m just a small business & homelab guy, so I’ll probably never use one of these big distributed file systems. But when people start talking petabytes, I always wonder if these things are actually backed up and what you use for backup and recovery?
replies(5): >>43717690 #>>43718697 #>>43720813 #>>43724292 #>>43726423 #
shermantanktop ◴[] No.43718697[source]
Backup and recovery is a process with a non-zero failure rate. The more you test it, the lower the rate, but there is always a failure mode.

With these systems, the runtime guarantees of data integrity are very high and the failure rate is very low. And best of all, failure is constantly happening as a normal activity in the system.

So once you have data integrity guarantees that are better in you runtime system than your backup process, why backup?

There are still reasons, but they become more specific to the data being stored and less important as a general datastore feature.

replies(1): >>43719218 #
Eikon ◴[] No.43719218[source]
> why backup?

Because of mistakes and malicious actors...

replies(1): >>43719485 #
overfeed ◴[] No.43719485[source]
...and the "Disaster" in "Disaster recovery" may have been localized and extensive (fire, flooding, major earthquake, brownouts due to a faulty transformer, building collapse, a solvent tanker driving through the wall into the server room, a massive sinkhole, etc)
replies(1): >>43720567 #
shermantanktop ◴[] No.43720567[source]
Yes, the dreaded fiber vs. backhoe. But if your distributed file system is geographically redundant, you're not exposed to that, at least from an integrity POV. It sucks that 1/3 or 1/5 or whatever of your serving fleet just disappeared, but backup won't help with that.
replies(1): >>43722349 #
1. overfeed ◴[] No.43722349[source]
> But if your distributed file system is geographically redundant

Redundancy and backups are not the same thing! There's some overlap, but treating them as interchangeable will occasionally result in terrible outcomes, like when a config change that results in all 5/5 datacenters fragmenting and failing to create a quorum, then finding out your services have circular dependencies when you are trying to bootstrap foundational services. Local backups would solve this, each DC would load last known good config, but rebuilding consensus necessary for redundancy requires coordination from now-unreachable hosts.