Bcachefs may be headed out of the kernel

btrfs is OK for a single disk. All the raid modes are not good, not just the parity modes.

The biggest reason raid btrfs is not trustable is that it has no mechanism for correctly handling a temporary device loss. It will happily rejoin an array where one of the devices didn’t see all the writes. This gives a 1/N chance of returning corrupt data for nodatacow (due to read-balancing), and for all other data it will return corrupt data according to the probability of collision of the checksum. (The default is still crc32c, so high probability for many workloads.) It apparently has no problem even with joining together a split-brained filesystem (where the two halves got distinct writes) which will happily eat itself.

One of the shittier aspects of this is that it is not clearly communicated to application developers that btrfs with nodatacow offers less data integrity than ext4 with raid, so several vendors (systemd, postgres, libvirt) turn on nodatacow by default for their data, which then gets corrupted when this problem occurs, and users won’t even know until it is too late because they didn’t enable nodatacow.

The main dev knows this is a problem but they do seem quite committed to not taking any of it seriously, given that they were arguing about it at least seven years ago[0], it’s still not fixed, and now the attitude seems to just ignore anyone who brings it up again (it comes up probably once or twice a year on the ML). Just getting them to accept documentation changes to increase awareness of the risk was like pulling teeth. It is perhaps illustrative that when Synology decided to commit to btrfs they apparently created some abomination that threads btrfs csums through md raid for error correction instead of using btrfs raid.

It is very frustrating for me because a trivial stale device bitmap written to each device would fix it totally, and more intelligently using a write intent bitmap like md, but I had to be deliberately antagonistic on the ML for the main developer to even reply at all after yet another user was caught out losing data because of this. Even then, they just said I should not talk about things I don’t understand. As far as I can tell, this is because they thought “write intent bitmap” meant a specific implementation that does not work with zone append, and I was an unserious person for not saying “write intent log” or something more generic. (This is speculation, though—they refused to engage any more when I asked for clarification, and I am not a filesystem designer, so I might actually be wrong, though I’m not sure why everyone has to suffer because a rarefied few are using zoned storage.)

A less serious but still unreasonable behaviour is that btrfs is designed to immediately go read-only if redundancy is lost, so even if you could write to the remaining good device(s), it will force you to lose anything still in transit/memory if you lose redundancy. (Except that it also doesn’t detect when a device drops through e.g. a dm layer, so you can actually ‘only’ have to deal with the much bigger first problem if you are using FDE or similar.) You could always mount with `-o degraded` to avoid this but then you are opening yourself up to inadvertently destroying your array due to the first problem if you have some thing like a backplane power issue.

Finally, unlike traditional raid, btrfs tools don’t make it possible to handle an online removal of an unhealthy device without risking data loss because in order to remove an unhealthy but extant device you must first reduce the redundancy of the array—but doing that will just cause btrfs to rebalance across all the devices, including the unhealthy one, and potentially taking corrupt data from the bad device and overwriting on the good device, or just losing the whole array if the unhealthy device fails totally during the two required rebalances.

There are some other issues where it becomes basically impossible to recover a filesystem that is very full because you cannot even delete files any more but I think this is similar on all CoW filesystems. This at least won’t eat data directly, but will cause downtime and expense to rebuild the filesystem.

The last time I was paying attention a few months ago, most of the work going into btrfs seemed to be all about improving performance and zoned devices. They won’t reply to any questions or offers for funding or personnel to complete work. It’s all very weird and unfortunate.

[0] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...