The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.
The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.
The extended (not extant) family (including ext4) don't support copy-on-write. Using them as your primary FS after 2020 (or even 2010) is like using a non-journaling file system after 2010 (or even 2001)--it's a non-negotiable feature at this point. Btrfs has been stable for a decade, and if you don't like or trust it, there's always ZFS, which has been stable 20 years now. Apple now has AppFS, with CoW, on all their devices, while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
They seem to be slowly introducing it to the masses, Dev drives you set up on Windows automatically use ReFS
Being able to quickly take a "backup" copy of some multi-gb directory tree before performing some potentially destructive operation on it is such a nice safety net to have.
It's also a handy way to backup file metadata, like mtime, without having to design a file format for mapping saved mtimes back to their host files.
You're thinking of the optimization technique of CoW, as in what Linux does when spawning a new thread or forking a process. I'm talking about it in the context of only ever modifying copies of file system data and metadata blocks, for the purpose of ensuring file system integrity, even in the context of sudden power loss (EDIT: wrong link): https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
If anything, ordinary file IO is likely to be slightly slower on a CoW file system, due to it always having to copy a block before said block can be modified and updating block pointers.
What kind of journaling though? By default ext4 only uses journaling for metadata updates, not data updates (see "ordered" mode in ext4(5)).
So if you have a (e.g.) 1000MB file, and you update 200MB in the middle of it, you can have a situation where the first 100MB is written out and the system dies with the other 100MB vanishing.
With a CoW, if the second 100MB is not written out and the file sync'd, then on system recovery you're back to the original file being completely intact. With ext4 in the default configuration you have a file that has both new-100MB and stale-100MB in the middle of it.
The updating of the file data and the metadata are two separate steps (by default) in ext4:
* https://www.baeldung.com/linux/ext-journal-modes
* https://michael.kjorling.se/blog/2024/ext4-defaulting-to-dat...
* https://fy.blackhats.net.au/blog/2024-08-13-linux-filesystem...
Whereas with a proper CoW (like ZFS), updates are ACID.
Speed is sometimes more important than absolute reliability, but it’s still an undesirable tradeoff.
No, it doesn't. Maybe you're thinking of shadow volume copies or something else. CoW files systems never modify data or metadata blocks directly, only modifying copies, with the root of the updated block pointer graph only updated after all other changes have been synced. Read this: https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...
> while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.
By this I implied it's an embarrassment to MSFT that iOS devices have a better, more reliable file system (AppFS) than even Windows servers now (having to rely on NTFS until ReFS is ready for prime time). If HN users and mods didn't tone-police so heavily, I could state things more frankly.
However, there is also a long time since XFS supports snapshots.
See for example:
https://thelinuxcode.com/xfs-snapshot/
I am not sure what you mean by "whole-volume" snapshots, but I have not noticed any restrictions in the use of the XFS snapshots. As expected, they store a snapshot of the entire file system, which can be restored later.
In many decades of managing computers with all kinds of operating systems and file systems, on a variety of servers and personal computers, I have never had the need to shrink a file system. I cannot imagine how such a need can arise, except perhaps as a consequence of bad planning. There are also many decades since I have deprecated the use of multiple partitions on a storage device, with the exception of bootable devices, which must have a dedicated partition for booting, conforming to the BIOS or UEFI expectations. For anything that was done in the ancient times with multiple partitions there are better alternatives now. With the exception of bootable USB sticks with live Linux or FreeBSD partitions, I use XFS on whole SSDs or HDDs (i.e. unpartitioned), regardless if they are internal or external, so there is never any need for changing the size of the file system.
Even so, copying a file system to an external device, reformatting the device and copying the file system back is not likely to be significantly slower than shrinking in place. In fact sometimes it can be faster and it has the additional benefit that the new copy of the file system will be defragmented.
Much more significant than the lack of shrinking ability, which may slow down a little something that occurs very seldom, is that both EXT4 and XFS are much faster for most applications than the other file systems available for Linux, so they are fast for the frequent operations. You may choose another file system for other reasons, but choosing it for making faster a very rare operation like shrinking is a very weak reason.
I.e. back in ~ 2013-2014 while moving some baremetal Windows server into VMware, srhinking and then optimizing MFT helped to save AFAIR 2 hours of downtime window.
> except perhaps as a consequence of bad planning
Assuming people go to Clouds instead of physical servers because they may need to add 100 more nodes "suddenly" - selling point of Clouds is "avoid planning" - one may expect cases of need of shrinking are rising, now lowing. It may be mitigated by different approaches of course - i.e. often it's easier to resetup VM, but yet.
In migrations you normally copy the file system elsewhere, to the cloud or to different computers, you do not shrink it in place, which is what XFS cannot do. Unlike with Windows, copying Linux file systems, including XFS, during migrations to different hardware is trivial and fast. The same is true for multiplicating a file system to a big set of computers.
Shrinking in place is normally needed only when you share a physical device between 2 different operating systems, which use incompatible file systems, e.g. Windows and Linux, and you discover that you did not partition well the physical device and you want to shrink the partition allocated for one of the operating systems, in order to be able to expand the partition allocated for the other operating system.
Sharing physical devices between Windows and any other operating systems comes with a lot of risks and disadvantages, so I strongly recommend against it. I have stopped sharing Windows disks decades ago. Now, if I want to use the same computer in Windows and in another operating system, e.g. Linux or FreeBSD, I install Windows on the internal SSD, and, when desired, I boot Linux or FreeBSD from an external SSD. Thus the problem of reallocating a shared SSD/HDD by shrinking a partition never arises.
As for ensuring data integrity, I cannot talk about other CoW filesystems, but ZFS has an atomic transaction commit that relies on CoW. In ZFS, your changes either happened or they did not happen. The entire file system is a giant merkle tree and every change requires that all nodes of the tree up to the root be rewritten. To minimize the penalty of CoW, these changes are aggregated into transaction groups that are then committed atomically. Thus, you simultaneously have both the old and new versions available, plus possible more than just 1 old version. ZFS will start recycling space after a couple transaction group commits, but often, you can go further back in its history if needed after some catastrophic event, although ZFS makes no solid guarantee of this (until you fiddle with module parameter settings to prevent reclaim from being so aggressive).
If it counts for anything, I have hundreds of commits in OpenZFS, so I am fairly familiar with how ZFS works internally.
https://github.com/openzfs/zfs/blob/6af8db61b1ea489ade2d5344...
Small writes on ZFS are ACID. If ZFS made large writes ACID, large writes could block the transaction group commit for arbitrarily long periods, which is why it does not. Just imagine writing a 1PB file. It would likely take a long time (days?) and it is just not reasonable to block the transaction group commit until it finishes.
That said, for your example, you will often have all of the writes go into the same transaction group commit, such that it becomes ACID, but this is not a strict guarantee. The maximum atomic write size on ZFS is 32MB, assuming alignment. If the write is not aligned to the record size, it will be smaller, as per:
https://github.com/openzfs/zfs/blob/6af8db61b1ea489ade2d5344...