Bcachefs may be headed out of the kernel

(lwn.net)

Show context

msgodel ◴[04 Jul 25 17:53 UTC] No.44466535[source]▶

The older I get the more I feel like anything other than the ExtantFS family is just silly.

The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.

replies(3): >>44466685 #>>44466895 #>>44467306 #

anonnon ◴[04 Jul 25 18:10 UTC] No.44466685[source]▶

>>44466535 #

> The older I get the more I feel like anything other than the ExtantFS family is just silly.

The extended (not extant) family (including ext4) don't support copy-on-write. Using them as your primary FS after 2020 (or even 2010) is like using a non-journaling file system after 2010 (or even 2001)--it's a non-negotiable feature at this point. Btrfs has been stable for a decade, and if you don't like or trust it, there's always ZFS, which has been stable 20 years now. Apple now has AppFS, with CoW, on all their devices, while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.

replies(7): >>44466706 #>>44466709 #>>44466817 #>>44467125 #>>44467236 #>>44467926 #>>44468462 #

1. NewJazz ◴[04 Jul 25 19:33 UTC] No.44467236[source]▶

>>44466685 #

CoW is an efficiency gain. Does it do anything to ensure data integrity, like journaling does? I think it is an unreasonable comparison you are making.

replies(4): >>44467447 #>>44467606 #>>44467639 #>>44474557 #

2. webstrand ◴[04 Jul 25 20:04 UTC] No.44467447[source]▶

>>44467236 (TP) #

I use CoW a lot just managing files. It's only an efficiency gain if you have enough space to do the data-copying operation. And that's not necessarily true in all cases.

Being able to quickly take a "backup" copy of some multi-gb directory tree before performing some potentially destructive operation on it is such a nice safety net to have.

It's also a handy way to backup file metadata, like mtime, without having to design a file format for mapping saved mtimes back to their host files.

3. anonnon ◴[04 Jul 25 20:25 UTC] No.44467606[source]▶

>>44467236 (TP) #

> CoW is an efficiency gain.

You're thinking of the optimization technique of CoW, as in what Linux does when spawning a new thread or forking a process. I'm talking about it in the context of only ever modifying copies of file system data and metadata blocks, for the purpose of ensuring file system integrity, even in the context of sudden power loss (EDIT: wrong link): https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...

If anything, ordinary file IO is likely to be slightly slower on a CoW file system, due to it always having to copy a block before said block can be modified and updating block pointers.

4. throw0101d ◴[04 Jul 25 20:29 UTC] No.44467639[source]▶

>>44467236 (TP) #

> Does it do anything to ensure data integrity, like journaling does?

What kind of journaling though? By default ext4 only uses journaling for metadata updates, not data updates (see "ordered" mode in ext4(5)).

So if you have a (e.g.) 1000MB file, and you update 200MB in the middle of it, you can have a situation where the first 100MB is written out and the system dies with the other 100MB vanishing.

With a CoW, if the second 100MB is not written out and the file sync'd, then on system recovery you're back to the original file being completely intact. With ext4 in the default configuration you have a file that has both new-100MB and stale-100MB in the middle of it.

The updating of the file data and the metadata are two separate steps (by default) in ext4:

* https://www.baeldung.com/linux/ext-journal-modes

* https://michael.kjorling.se/blog/2024/ext4-defaulting-to-dat...

* https://fy.blackhats.net.au/blog/2024-08-13-linux-filesystem...

Whereas with a proper CoW (like ZFS), updates are ACID.

replies(1): >>44474604 #

5. ryao ◴[05 Jul 25 18:26 UTC] No.44474557[source]▶

>>44467236 (TP) #

In what way do you consider CoW to be an efficiency gain? Traditionally, it is considered more expensive due to write amplification. In place filesystems such as XFS tend to be more efficient in terms of IOPs and CoW filesystems need to do many tricks to be close to them.

As for ensuring data integrity, I cannot talk about other CoW filesystems, but ZFS has an atomic transaction commit that relies on CoW. In ZFS, your changes either happened or they did not happen. The entire file system is a giant merkle tree and every change requires that all nodes of the tree up to the root be rewritten. To minimize the penalty of CoW, these changes are aggregated into transaction groups that are then committed atomically. Thus, you simultaneously have both the old and new versions available, plus possible more than just 1 old version. ZFS will start recycling space after a couple transaction group commits, but often, you can go further back in its history if needed after some catastrophic event, although ZFS makes no solid guarantee of this (until you fiddle with module parameter settings to prevent reclaim from being so aggressive).

If it counts for anything, I have hundreds of commits in OpenZFS, so I am fairly familiar with how ZFS works internally.

6. ryao ◴[05 Jul 25 18:33 UTC] No.44474604[source]▶

>>44467639 #

Large file writes are an exception in ZFS. They are broken into multiple transactions, which can go into multiple transaction groups, such that the updates are not ACID. You can see this in the code here:

https://github.com/openzfs/zfs/blob/6af8db61b1ea489ade2d5344...

Small writes on ZFS are ACID. If ZFS made large writes ACID, large writes could block the transaction group commit for arbitrarily long periods, which is why it does not. Just imagine writing a 1PB file. It would likely take a long time (days?) and it is just not reasonable to block the transaction group commit until it finishes.

That said, for your example, you will often have all of the writes go into the same transaction group commit, such that it becomes ACID, but this is not a strict guarantee. The maximum atomic write size on ZFS is 32MB, assuming alignment. If the write is not aligned to the record size, it will be smaller, as per:

https://github.com/openzfs/zfs/blob/6af8db61b1ea489ade2d5344...

↑