Most active commenters

koverstreet(3)

Popular/hot comments

>>44470426 #

←back to thread

Bcachefs may be headed out of the kernel

(lwn.net)

Show context

anonfordays ◴[04 Jul 25 21:32 UTC] No.44468023[source]▶

>>44464396 (OP) #

Linux needs a true answer to ZFS that's not btrfs. Sadly the ship has sailed for btrfs, after 15+ years it's still not something trustable.

Apparently bcachefs won't be the successor. Filesystem development for Linux needs a big shakeup.

replies(3): >>44468201 #>>44468282 #>>44477142 #

1. em-bee ◴[04 Jul 25 22:09 UTC] No.44468282[source]▶

>>44468023 #

several people i know are using btrfs without problems for years now. i use it on half a dozen devices. what's your evidence that it is not trustable?

replies(6): >>44468404 #>>44469005 #>>44469921 #>>44470426 #>>44470963 #>>44474083 #

2. rcxdude ◴[04 Jul 25 23:59 UTC] No.44469005[source]▶

>>44468282 (TP) #

Many reports of data loss or even complete filesystem loss, often in very straightforward scenarios.

3. yjftsjthsd-h ◴[05 Jul 25 03:31 UTC] No.44469921[source]▶

>>44468282 (TP) #

In this case, some people using it and not having problems is much less interesting than some people that are having problems. As a former user who lost 2 root filesystems to BTRFS, I'm not touching it for a long time.

4. csnover ◴[05 Jul 25 06:05 UTC] No.44470426[source]▶

>>44468282 (TP) #

btrfs is OK for a single disk. All the raid modes are not good, not just the parity modes.

The biggest reason raid btrfs is not trustable is that it has no mechanism for correctly handling a temporary device loss. It will happily rejoin an array where one of the devices didn’t see all the writes. This gives a 1/N chance of returning corrupt data for nodatacow (due to read-balancing), and for all other data it will return corrupt data according to the probability of collision of the checksum. (The default is still crc32c, so high probability for many workloads.) It apparently has no problem even with joining together a split-brained filesystem (where the two halves got distinct writes) which will happily eat itself.

One of the shittier aspects of this is that it is not clearly communicated to application developers that btrfs with nodatacow offers less data integrity than ext4 with raid, so several vendors (systemd, postgres, libvirt) turn on nodatacow by default for their data, which then gets corrupted when this problem occurs, and users won’t even know until it is too late because they didn’t enable nodatacow.

The main dev knows this is a problem but they do seem quite committed to not taking any of it seriously, given that they were arguing about it at least seven years ago[0], it’s still not fixed, and now the attitude seems to just ignore anyone who brings it up again (it comes up probably once or twice a year on the ML). Just getting them to accept documentation changes to increase awareness of the risk was like pulling teeth. It is perhaps illustrative that when Synology decided to commit to btrfs they apparently created some abomination that threads btrfs csums through md raid for error correction instead of using btrfs raid.

It is very frustrating for me because a trivial stale device bitmap written to each device would fix it totally, and more intelligently using a write intent bitmap like md, but I had to be deliberately antagonistic on the ML for the main developer to even reply at all after yet another user was caught out losing data because of this. Even then, they just said I should not talk about things I don’t understand. As far as I can tell, this is because they thought “write intent bitmap” meant a specific implementation that does not work with zone append, and I was an unserious person for not saying “write intent log” or something more generic. (This is speculation, though—they refused to engage any more when I asked for clarification, and I am not a filesystem designer, so I might actually be wrong, though I’m not sure why everyone has to suffer because a rarefied few are using zoned storage.)

A less serious but still unreasonable behaviour is that btrfs is designed to immediately go read-only if redundancy is lost, so even if you could write to the remaining good device(s), it will force you to lose anything still in transit/memory if you lose redundancy. (Except that it also doesn’t detect when a device drops through e.g. a dm layer, so you can actually ‘only’ have to deal with the much bigger first problem if you are using FDE or similar.) You could always mount with `-o degraded` to avoid this but then you are opening yourself up to inadvertently destroying your array due to the first problem if you have some thing like a backplane power issue.

Finally, unlike traditional raid, btrfs tools don’t make it possible to handle an online removal of an unhealthy device without risking data loss because in order to remove an unhealthy but extant device you must first reduce the redundancy of the array—but doing that will just cause btrfs to rebalance across all the devices, including the unhealthy one, and potentially taking corrupt data from the bad device and overwriting on the good device, or just losing the whole array if the unhealthy device fails totally during the two required rebalances.

There are some other issues where it becomes basically impossible to recover a filesystem that is very full because you cannot even delete files any more but I think this is similar on all CoW filesystems. This at least won’t eat data directly, but will cause downtime and expense to rebuild the filesystem.

The last time I was paying attention a few months ago, most of the work going into btrfs seemed to be all about improving performance and zoned devices. They won’t reply to any questions or offers for funding or personnel to complete work. It’s all very weird and unfortunate.

[0] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

replies(3): >>44472749 #>>44472782 #>>44472881 #

5. em-bee ◴[05 Jul 25 08:06 UTC] No.44470963[source]▶

>>44468282 (TP) #

i know it's not appropriate to complain about downvotes, but anonfordays responds to my question with an actual answer ( https://news.ycombinator.com/item?id=44468404 ) and more importantly with a link to the btrfs status page ( https://btrfs.readthedocs.io/en/latest/Status.html ) that i was not aware of (but as a btrfs user should have been) and you all downvote that to death. why? what possible disagreement could you have with that?

replies(1): >>44475373 #

6. koverstreet ◴[05 Jul 25 13:42 UTC] No.44472749[source]▶

>>44470426 #

> The biggest reason raid btrfs is not trustable is that it has no mechanism for correctly handling a temporary device loss. It will happily rejoin an array where one of the devices didn’t see all the writes. This gives a 1/N chance of returning corrupt data for nodatacow (due to read-balancing), and for all other data it will return corrupt data according to the probability of collision of the checksum. (The default is still crc32c, so high probability for many workloads.) It apparently has no problem even with joining together a split-brained filesystem (where the two halves got distinct writes) which will happily eat itself.

That is just mind bogglingly inept. (And thanks, I hadn't heard THIS one before).

For nocow mode, there is a bloody simple solution: you just fall back to a cow write if you can't write to every replica. And considering you have to have the cow fallback anyways - maybe the data is compressed, or you just took a snapshot, or the replication level is different - you have to work really hard or be really inept to screw this one up.

I honestly have no idea how you'd get this wrong in cow mode. The whole point of a cow filesystem is that it makes these sorts of problems go away.

I'm not even going to go through the rest of the list, but suffice it to say - every single broken thing I've ever seen mentioned about btrfs multi device mode is fixed in bcachefs.

Every. Single. One. And it's not like I ever looked at btrfs for a list of things to make sure I got right, but every time someone mentions one of these things - I'll check the code if I don't remember, some of this code I wrote 10 years ago, but I yet to have seen someone mention something broken about btrfs multi device mode that bcachefs doesn't get right.

It's honestly mind boggling.

7. koverstreet ◴[05 Jul 25 13:46 UTC] No.44472782[source]▶

>>44470426 #

> The last time I was paying attention a few months ago, most of the work going into btrfs seemed to be all about improving performance and zoned devices. They won’t reply to any questions or offers for funding or personnel to complete work. It’s all very weird and unfortunate.

By the way, if that was serious, bcachefs would love the help, and more people are joining the party.

I would love to find someone to take over erasure coding and finish it off.

replies(1): >>44477832 #

8. tobias3 ◴[05 Jul 25 14:03 UTC] No.44472881[source]▶

>>44470426 #

The btrfs devs are mainly employed by Meta and SuSE and they only support single devices (I haven't looked up recently if SuSE supports multiple device fs). Meta probably uses zoned storage devices, so that is why they are focusing on that.

Unfortunately I don't think Patreon can fund the kind of talent you need to sustainably develop a file system.

That btrfs contains broken features is IMO 50/50 the fault of up-stream and the distributions. Distributions should patch out features that are broken (like btrfs multi-device support, direct IO) or clearly put it behind experimental flags. Up-stream is unfortunately incentivised to not do this, to get testers.

replies(1): >>44473300 #

9. koverstreet ◴[05 Jul 25 15:12 UTC] No.44473300{3}[source]▶

>>44472881 #

Patreon has never been my main source of funding. (It has been a very helpful backstop though!)

But I do badly need more funding, this would go better with a real team behind it. Right now I'm trying to find the money to bring Alan Huang on full time; he's fresh out of school but very sharp and motivated, and he's already been doing excellent work.

If anyone can help with that, hit me up :)

10. tandr ◴[05 Jul 25 17:10 UTC] No.44474083[source]▶

>>44468282 (TP) #

I tried it as a FS for a data volume (200GB) on Linux a year ago, after reading how stable it is "now". The first hard crash made it unrecoverable no matter what I have tried. Never again.

11. anonfordays ◴[05 Jul 25 20:40 UTC] No.44475373[source]▶

>>44470963 #

I did not down vote you, and my post was flagged or dead:

https://btrfs.readthedocs.io/en/latest/Status.html

The amount of "mostly OK" and still an "unstable" RAID6 implementation. Not going to trust a file system with "mostly OK" device replace. Anecdotally, you can search the LKML and here for tons of data loss stories.

12. csnover ◴[06 Jul 25 04:25 UTC] No.44477832{3}[source]▶

>>44472782 #

In my case it was a last-ditch effort to get them to explain what was keeping them from making raid actually safe. Others have offered more concrete support more recently[0], I guess you could try reaching out to them, though I suppose they are interested in funding btrfs because they are using btrfs.

I share the sentiments of others in this discussion that I hope you are able to resolve the process issues so that bcachefs does become a viable long-term filesystem. There likely won’t be any funding from anyone ever if it looks like it’s going to get the boot. btrfs also has substantial project management issues (take a look at the graveyard of untriaged bug reports on kernel.org as one more example[1]), they just manage to keep theirs under the radar.

[0] https://lore.kernel.org/linux-btrfs/CAEFpDz+R3rLW8iujSd2m4jH...

[1] https://bugzilla.kernel.org/buglist.cgi?bug_status=__open__&...

↑