Most active commenters

anonnon(4)
yjftsjthsd-h(4)
timewizard(4)
throw0101d(4)
msgodel(3)
ryao(3)

Popular/hot comments

>>44466685 #
>>44466882 #
>>44467236 #
>>44466895 #
>>44467338 #
>>44467765 #
>>44467818 #
>>44469707 #

←back to thread

Bcachefs may be headed out of the kernel

(lwn.net)

1. msgodel ◴[04 Jul 25 17:53 UTC] No.44466535[source]▶

>>44464396 (OP) #

The older I get the more I feel like anything other than the ExtantFS family is just silly.

The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.

replies(3): >>44466685 #>>44466895 #>>44467306 #

2. anonnon ◴[04 Jul 25 18:10 UTC] No.44466685[source]▶

>>44466535 (TP) #

> The older I get the more I feel like anything other than the ExtantFS family is just silly.

The extended (not extant) family (including ext4) don't support copy-on-write. Using them as your primary FS after 2020 (or even 2010) is like using a non-journaling file system after 2010 (or even 2001)--it's a non-negotiable feature at this point. Btrfs has been stable for a decade, and if you don't like or trust it, there's always ZFS, which has been stable 20 years now. Apple now has AppFS, with CoW, on all their devices, while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.

replies(7): >>44466706 #>>44466709 #>>44466817 #>>44467125 #>>44467236 #>>44467926 #>>44468462 #

3. msgodel ◴[04 Jul 25 18:13 UTC] No.44466706[source]▶

>>44466685 #

Again I don't really want the kernel managing a database for me like that, the few applications that need that can do it themselves just fine. (IME mostly just RDBMSs and Qemu.)

4. robotnikman ◴[04 Jul 25 18:13 UTC] No.44466709[source]▶

>>44466685 #

>Windows will at some point have ReFS

They seem to be slowly introducing it to the masses, Dev drives you set up on Windows automatically use ReFS

5. milkey_mouse ◴[04 Jul 25 18:30 UTC] No.44466817[source]▶

>>44466685 #

Hell, there's XFS if you love stability but want CoW.

replies(1): >>44466882 #

6. josephcsible ◴[04 Jul 25 18:41 UTC] No.44466882{3}[source]▶

>>44466817 #

XFS doesn't support whole-volume snapshots, which is the main reason I want CoW filesystems. And it also stands out as being basically the only filesystem that you can't arbitrarily shrink without needing to wipe and reformat.

replies(4): >>44467133 #>>44467505 #>>44468243 #>>44470078 #

7. yjftsjthsd-h ◴[04 Jul 25 18:42 UTC] No.44466895[source]▶

>>44466535 (TP) #

I mean, I'd really like some sort of data error detection (and ideally correction). If a disk bitflips one of my files, ext* won't do anything about it.

replies(3): >>44467338 #>>44468600 #>>44469211 #

8. leogao ◴[04 Jul 25 19:17 UTC] No.44467125[source]▶

>>44466685 #

btrfs has eaten my data within the last decade. (not even because of the broken erasure coding, which I was careful to avoid!) not sure I'm willing to give it another chance. I'd much rather use zfs.

replies(1): >>44468255 #

9. leogao ◴[04 Jul 25 19:18 UTC] No.44467133{4}[source]▶

>>44466882 #

you can always have an LVM layer for atomic snapshots

replies(2): >>44467244 #>>44476293 #

10. NewJazz ◴[04 Jul 25 19:33 UTC] No.44467236[source]▶

>>44466685 #

CoW is an efficiency gain. Does it do anything to ensure data integrity, like journaling does? I think it is an unreasonable comparison you are making.

replies(4): >>44467447 #>>44467606 #>>44467639 #>>44474557 #

11. josephcsible ◴[04 Jul 25 19:34 UTC] No.44467244{5}[source]▶

>>44467133 #

There are advantages to having the filesystem do the snapshots itself. For example, if you have a really big file that you keep deleting and restoring from a snapshot, you'll only pay the cost of the space once with Btrfs, but will pay it every time over with LVM.

12. heavyset_go ◴[04 Jul 25 19:44 UTC] No.44467306[source]▶

>>44466535 (TP) #

Transparent compression, checksumming, copy-on-write, snapshots and virtual subvolumes should be considered the minimum default feature set for new OS installations in TYOOL 2025.

You get that with APFS by default on macOS these days and those features come for free in btrfs, some in XFS, etc on Linux.

replies(1): >>44467710 #

13. timewizard ◴[04 Jul 25 19:49 UTC] No.44467338[source]▶

>>44466895 #

> some sort of data error detection (and ideally correction).

That's pretty much built into most mass storage devices already.

> If a disk bitflips one of my files

The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.

> ext* won't do anything about it.

What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block? Without an fsck what mechanism do you suppose would manage these errors as they're discovered?

replies(3): >>44467434 #>>44467818 #>>44468075 #

14. throw0101d ◴[04 Jul 25 20:02 UTC] No.44467434{3}[source]▶

>>44467338 #

>> > some sort of data error detection (and ideally correction).

> That's pretty much built into most mass storage devices already.

And ZFS has shown that it is not sufficient (at least for some use-cases, perhaps less of a big deal for 'residential' users).

> The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.

Not worth it to whom? Not having the option available at all is the problem. I can do a zfs set checksum=off pool_name/dataset_name if I really want that extra couple percentage points of performance.

> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?

Depends on the data involved: if it's part of the file system tree metadata there are often multiple copies even for a single disk on ZFS. So instead of the kernel consuming corrupted data and potentially panicing (or going off into the weeds) it can find a correct copy elsewhere.

If you're in a fancier configuration with some level of RAID, then there could be other copies of the data, or it could be rebuilt through ECC.

With ext*, LVM, and mdadm no such possibility exists because there are no checksums at any of those layers (perhaps if you glom on dm-integrity?).

And with ZFS one can set copies=2 on a per-dataset basis (perhaps just for /home?), and get multiple copies strewn across the disk: won't save you from a drive dying, but could save you from corruption.

replies(2): >>44468039 #>>44469707 #

15. webstrand ◴[04 Jul 25 20:04 UTC] No.44467447{3}[source]▶

>>44467236 #

I use CoW a lot just managing files. It's only an efficiency gain if you have enough space to do the data-copying operation. And that's not necessarily true in all cases.

Being able to quickly take a "backup" copy of some multi-gb directory tree before performing some potentially destructive operation on it is such a nice safety net to have.

It's also a handy way to backup file metadata, like mtime, without having to design a file format for mapping saved mtimes back to their host files.

16. kzrdude ◴[04 Jul 25 20:11 UTC] No.44467505{4}[source]▶

>>44466882 #

there was the "old dog new tricks" xfs talk long time ago, but I suppose it was for fun and exploration and not really a sneak peek into snapshots

17. anonnon ◴[04 Jul 25 20:25 UTC] No.44467606{3}[source]▶

>>44467236 #

> CoW is an efficiency gain.

You're thinking of the optimization technique of CoW, as in what Linux does when spawning a new thread or forking a process. I'm talking about it in the context of only ever modifying copies of file system data and metadata blocks, for the purpose of ensuring file system integrity, even in the context of sudden power loss (EDIT: wrong link): https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...

If anything, ordinary file IO is likely to be slightly slower on a CoW file system, due to it always having to copy a block before said block can be modified and updating block pointers.

18. throw0101d ◴[04 Jul 25 20:29 UTC] No.44467639{3}[source]▶

>>44467236 #

> Does it do anything to ensure data integrity, like journaling does?

What kind of journaling though? By default ext4 only uses journaling for metadata updates, not data updates (see "ordered" mode in ext4(5)).

So if you have a (e.g.) 1000MB file, and you update 200MB in the middle of it, you can have a situation where the first 100MB is written out and the system dies with the other 100MB vanishing.

With a CoW, if the second 100MB is not written out and the file sync'd, then on system recovery you're back to the original file being completely intact. With ext4 in the default configuration you have a file that has both new-100MB and stale-100MB in the middle of it.

The updating of the file data and the metadata are two separate steps (by default) in ext4:

* https://www.baeldung.com/linux/ext-journal-modes

* https://michael.kjorling.se/blog/2024/ext4-defaulting-to-dat...

* https://fy.blackhats.net.au/blog/2024-08-13-linux-filesystem...

Whereas with a proper CoW (like ZFS), updates are ACID.

replies(1): >>44474604 #

19. riobard ◴[04 Jul 25 20:40 UTC] No.44467710[source]▶

>>44467306 #

APFS checksums only fs metadata not user data which is a pita. Presumably because APFS is used on single drive systems and there’s no redundancy to recover from anyway. Still, not ideal.

replies(1): >>44467765 #

20. vbezhenar ◴[04 Jul 25 20:47 UTC] No.44467765{3}[source]▶

>>44467710 #

Apple trusts their hardware to do their own checksums properly. Modern SSD uses checksums and parity codes for blocks. SATA/NVMe include checksums for protocol frames. The only unreliable component is RAM, but FS checksums can't help here, because RAM bit likely will be flipped before checksum is calculated or after checksum is verified.

replies(3): >>44468024 #>>44468035 #>>44468119 #

21. ars ◴[04 Jul 25 20:54 UTC] No.44467818{3}[source]▶

>>44467338 #

> The likelihood .. of this occurring

That's 10^14 bits for a consumer drive. That's just 12TB. A heavy user (lots of videos or games) would see a bit flip a couple times a year.

replies(3): >>44468204 #>>44469358 #>>44469681 #

22. tbrownaw ◴[04 Jul 25 21:13 UTC] No.44467926[source]▶

>>44466685 #

> The extended (not extant) family (including ext4)

I read that more as "we have filesystems at home, and also get off my lawn".

23. riobard ◴[04 Jul 25 21:32 UTC] No.44468024{4}[source]▶

>>44467765 #

If they do trust their hardware, APFS won’t need to checksum fs metadata either, so I guess they don’t trust it well enough? Also I have external drives that is not Apple sanctioned to store files and I don’t trust them enough either, and there’s no choice of user data checksumming at all.

replies(1): >>44470056 #

24. ◴[04 Jul 25 21:34 UTC] No.44468035{4}[source]▶

>>44467765 #

25. yjftsjthsd-h ◴[04 Jul 25 21:34 UTC] No.44468039{4}[source]▶

>>44467434 #

> (perhaps if you glom on dm-integrity?).

I looked at that, in hopes of being able to protect my data. Unfortunately, I considered this something of a fatal flaw:

> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed.

- https://wiki.archlinux.org/title/Dm-integrity

26. yjftsjthsd-h ◴[04 Jul 25 21:40 UTC] No.44468075{3}[source]▶

>>44467338 #

To your first couple points: I trust hardware less than you.

> What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block?

Well, that's what it does now, and I think that's a problem.

> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?

Linux can fail a read, and IMHO should do so if it cannot return correct data. (I support the ability to override this and tell it to give you the corrupted data, but certainly not by default.) On ZFS, if a read fails its checksum, the OS will first try to get a valid copy (ex. from a mirror or if you've set copies=2), and then if the error can't be recovered then the file read fails and the system reports/records the failure, at which point the user should probably go do a full scrub (which for our purposes should probably count as fsck) and restore the affected file(s) from backup. (Or possibly go buy a new hard drive, depending on the extent of the problem.) I would consider that ideal.

27. londons_explore ◴[04 Jul 25 21:49 UTC] No.44468119{4}[source]▶

>>44467765 #

Most SSD's can't be trusted to maintain proper data ordering in the case of a sudden power off.

That makes checksums and journals of only marginal usefulness.

I wish some review website would have a robot plug and unplug the power cable in a test rig for a few weeks and rate which SSD manufacturers are robust to this stuff.

replies(1): >>44473240 #

28. magicalhippo ◴[04 Jul 25 22:01 UTC] No.44468204{4}[source]▶

>>44467818 #

I do monthly scrubs on my NAS, I have 8 14-20TB drives that are quite full.

According to that 10^14 metric I should see read errors just about every month. Except I have just about zero.

Current disks are ~4 years, runs 24/7, and excluding a bad cable incident I've had a single case of a read error (recoverable, thanks ZFS).

I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.

replies(2): >>44469199 #>>44474491 #

29. MertsA ◴[04 Jul 25 22:05 UTC] No.44468243{4}[source]▶

>>44466882 #

You can shrink XFS, but only the realtime volume. All you need is xfs_db and a steady hand. I once had to pull this off for a shortened test program for a new server platform at Meta. Works great except some of those filesystems did somehow get this weird corruption around used space tracking that xfs_repair couldn't detect... It was mostly fine.

30. bombcar ◴[04 Jul 25 22:07 UTC] No.44468255{3}[source]▶

>>44467125 #

I used reiserfs for awhile after I noticed it eating data (tail packing for the power loss) but quickly switched to xfs when it became available.

Speed is sometimes more important than absolute reliability, but it’s still an undesirable tradeoff.

31. zahlman ◴[04 Jul 25 22:31 UTC] No.44468462[source]▶

>>44466685 #

... NTFS does copy-on-write?

... It does hard links? After checking: It does hard links.

... Why didn't any programs I had noticeably take advantage of that?

replies(1): >>44468793 #

32. eptcyka ◴[04 Jul 25 22:51 UTC] No.44468600[source]▶

>>44466895 #

Bitflips in my files? Well, there’s a high likelihood that the corruption won’t be too bad. Bit flips in the filesystem metadata? There’s a significant chance all of the data is lost.

33. anonnon ◴[04 Jul 25 23:21 UTC] No.44468793{3}[source]▶

>>44468462 #

> NTFS does copy-on-write?

No, it doesn't. Maybe you're thinking of shadow volume copies or something else. CoW files systems never modify data or metadata blocks directly, only modifying copies, with the root of the updated block pointer graph only updated after all other changes have been synced. Read this: https://www.qnx.com/developers/docs/8.0/com.qnx.doc.neutrino...

replies(1): >>44469318 #

34. ars ◴[05 Jul 25 00:38 UTC] No.44469199{5}[source]▶

>>44468204 #

If you are using enterprise drives those are 10^16, so that might explain it.

replies(1): >>44469334 #

35. msgodel ◴[05 Jul 25 00:41 UTC] No.44469211[source]▶

>>44466895 #

Anything important should be really be stored in some sort of distributed system that uses eg merkle trees. If the file system also did that you'd be doing it twice which would be annoying.

Anything unimportant is really just being cached and it's probably fine if it gets corrupted.

36. zahlman ◴[05 Jul 25 01:05 UTC] No.44469318{4}[source]▶

>>44468793 #

>No, it doesn't. Maybe you're thinking of shadow volume copies or something else.

I was asking, because didn't know, and I thought the other person was implying that it did.

I know what copy-on-write is.

replies(1): >>44469587 #

37. magicalhippo ◴[05 Jul 25 01:08 UTC] No.44469334{6}[source]▶

>>44469199 #

Fair, newest ones are, but two of my older current drives are IronWolfs 16TB which are 10^15 in the specs[1], and they've been running for 5.4 years. Again without any read errors, monthly scrubs, and of course daily use.

And before that I have been using 8x WD Reds 3TB for 6-7 years, which have 10^14 in the specs[2], and had the same experience with those.

Yes smaller size, but I ran scrubbing on those biweekly, and over so many years?

[1]: https://www.seagate.com/files/www-content/datasheets/pdfs/ir...

[2]: https://documents.westerndigital.com/content/dam/doc-library...

38. Dylan16807 ◴[05 Jul 25 01:16 UTC] No.44469358{4}[source]▶

>>44467818 #

I'm not really sure how you're supposed to interpret those error rates. The average read error probably has a lot more than 1 flipped bit, right? And if the average error affects 50 bits, then you'd expect 50x fewer errors? But I have no idea what the actual histogram looks like.

39. anonnon ◴[05 Jul 25 02:11 UTC] No.44469587{5}[source]▶

>>44469318 #

The "other person" (only mention of NTFS) is me, here:

> while MSFT still treats ReFS as unstable, and Windows servers still rely heavily on NTFS.

By this I implied it's an embarrassment to MSFT that iOS devices have a better, more reliable file system (AppFS) than even Windows servers now (having to rely on NTFS until ReFS is ready for prime time). If HN users and mods didn't tone-police so heavily, I could state things more frankly.

40. timewizard ◴[05 Jul 25 02:31 UTC] No.44469681{4}[source]▶

>>44467818 #

Is that raw error rate or uncorrected error rate?

41. timewizard ◴[05 Jul 25 02:36 UTC] No.44469707{4}[source]▶

>>44467434 #

> it can find a correct copy elsewhere.

Which implies you can already correct errors through a simple majority mechanism.

> or it could be rebuilt through ECC.

So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?

replies(3): >>44469858 #>>44476096 #>>44476875 #

42. yjftsjthsd-h ◴[05 Jul 25 03:12 UTC] No.44469858{5}[source]▶

>>44469707 #

> Which implies you can already correct errors through a simple majority mechanism.

I don't think so? You set copies=2, and the disk says that your file starts with 01010101, except that the second copy says your file starts with 01010100. How do you tell which one is right? For that matter, even with only one copy ex. ZFS can tell that what it has is wrong even if it can't fix it, and flagging the error is still useful.

> So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?

Similarly, you shouldn't need RAID to catch problems, only (potentially) to correct them. I do agree that it doesn't necessarily have to be in the FS layer, but AFAIK Linux doesn't have any other layers that do a good job of it (as mentioned above, dm-integrity exists but halving the write speed is a pretty big problem).

replies(1): >>44470732 #

43. 1over137 ◴[05 Jul 25 04:24 UTC] No.44470056{5}[source]▶

>>44468024 #

Apple does not care about your external non-Apple drives. In the slightest.

44. adrian_b ◴[05 Jul 25 04:31 UTC] No.44470078{4}[source]▶

>>44466882 #

Many years ago, XFS did not support snapshots.

However, there is also a long time since XFS supports snapshots.

See for example:

https://thelinuxcode.com/xfs-snapshot/

I am not sure what you mean by "whole-volume" snapshots, but I have not noticed any restrictions in the use of the XFS snapshots. As expected, they store a snapshot of the entire file system, which can be restored later.

In many decades of managing computers with all kinds of operating systems and file systems, on a variety of servers and personal computers, I have never had the need to shrink a file system. I cannot imagine how such a need can arise, except perhaps as a consequence of bad planning. There are also many decades since I have deprecated the use of multiple partitions on a storage device, with the exception of bootable devices, which must have a dedicated partition for booting, conforming to the BIOS or UEFI expectations. For anything that was done in the ancient times with multiple partitions there are better alternatives now. With the exception of bootable USB sticks with live Linux or FreeBSD partitions, I use XFS on whole SSDs or HDDs (i.e. unpartitioned), regardless if they are internal or external, so there is never any need for changing the size of the file system.

Even so, copying a file system to an external device, reformatting the device and copying the file system back is not likely to be significantly slower than shrinking in place. In fact sometimes it can be faster and it has the additional benefit that the new copy of the file system will be defragmented.

Much more significant than the lack of shrinking ability, which may slow down a little something that occurs very seldom, is that both EXT4 and XFS are much faster for most applications than the other file systems available for Linux, so they are fast for the frequent operations. You may choose another file system for other reasons, but choosing it for making faster a very rare operation like shrinking is a very weak reason.

replies(1): >>44470789 #

45. timewizard ◴[05 Jul 25 07:18 UTC] No.44470732{6}[source]▶

>>44469858 #

> I don't think so?

The disk is going to report an uncorrected error for one of them.

replies(1): >>44476968 #

46. CoolCold ◴[05 Jul 25 07:32 UTC] No.44470789{5}[source]▶

>>44470078 #

I definitely met several cases where support for shrinking would be beneficial - usually something about migrations and things like that, but yet I agree it's quite rare operation. Benefits come with lower amount of downtime window and/or expenses in time and duplicating systems.

I.e. back in ~ 2013-2014 while moving some baremetal Windows server into VMware, srhinking and then optimizing MFT helped to save AFAIR 2 hours of downtime window.

> except perhaps as a consequence of bad planning

Assuming people go to Clouds instead of physical servers because they may need to add 100 more nodes "suddenly" - selling point of Clouds is "avoid planning" - one may expect cases of need of shrinking are rising, now lowing. It may be mitigated by different approaches of course - i.e. often it's easier to resetup VM, but yet.

replies(1): >>44472188 #

47. adrian_b ◴[05 Jul 25 12:06 UTC] No.44472188{6}[source]▶

>>44470789 #

I do not see the connection between shrinking and migrations.

In migrations you normally copy the file system elsewhere, to the cloud or to different computers, you do not shrink it in place, which is what XFS cannot do. Unlike with Windows, copying Linux file systems, including XFS, during migrations to different hardware is trivial and fast. The same is true for multiplicating a file system to a big set of computers.

Shrinking in place is normally needed only when you share a physical device between 2 different operating systems, which use incompatible file systems, e.g. Windows and Linux, and you discover that you did not partition well the physical device and you want to shrink the partition allocated for one of the operating systems, in order to be able to expand the partition allocated for the other operating system.

Sharing physical devices between Windows and any other operating systems comes with a lot of risks and disadvantages, so I strongly recommend against it. I have stopped sharing Windows disks decades ago. Now, if I want to use the same computer in Windows and in another operating system, e.g. Linux or FreeBSD, I install Windows on the internal SSD, and, when desired, I boot Linux or FreeBSD from an external SSD. Thus the problem of reallocating a shared SSD/HDD by shrinking a partition never arises.

48. Quekid5 ◴[05 Jul 25 15:03 UTC] No.44473240{5}[source]▶

>>44468119 #

I'd say it makes checksums even more important so that you know whether something got corrupted immediately and not after a year (or whatever) has gone by and you actually need it.

replies(1): >>44476116 #

49. ryao ◴[05 Jul 25 18:15 UTC] No.44474491{5}[source]▶

>>44468204 #

> I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.

I am inclined to agree. However, I have one thought to the contrary. When a mechanical drive is failing, you tend to have debris inside the drive hitting the platters, causing damage that creates more debris, accelerating the drive’s eventual death, with read errors becoming increasingly common while it happens. When those are included in averages, the 10^14 might very well be accurate. I have not done any rigorous analysis to justify this thought and I do not have the data to be able to do that analysis. It is just something that occurs to me that might justify the 10^14 figure.

50. ryao ◴[05 Jul 25 18:26 UTC] No.44474557{3}[source]▶

>>44467236 #

In what way do you consider CoW to be an efficiency gain? Traditionally, it is considered more expensive due to write amplification. In place filesystems such as XFS tend to be more efficient in terms of IOPs and CoW filesystems need to do many tricks to be close to them.

As for ensuring data integrity, I cannot talk about other CoW filesystems, but ZFS has an atomic transaction commit that relies on CoW. In ZFS, your changes either happened or they did not happen. The entire file system is a giant merkle tree and every change requires that all nodes of the tree up to the root be rewritten. To minimize the penalty of CoW, these changes are aggregated into transaction groups that are then committed atomically. Thus, you simultaneously have both the old and new versions available, plus possible more than just 1 old version. ZFS will start recycling space after a couple transaction group commits, but often, you can go further back in its history if needed after some catastrophic event, although ZFS makes no solid guarantee of this (until you fiddle with module parameter settings to prevent reclaim from being so aggressive).

If it counts for anything, I have hundreds of commits in OpenZFS, so I am fairly familiar with how ZFS works internally.

51. ryao ◴[05 Jul 25 18:33 UTC] No.44474604{4}[source]▶

>>44467639 #

Large file writes are an exception in ZFS. They are broken into multiple transactions, which can go into multiple transaction groups, such that the updates are not ACID. You can see this in the code here:

https://github.com/openzfs/zfs/blob/6af8db61b1ea489ade2d5344...

Small writes on ZFS are ACID. If ZFS made large writes ACID, large writes could block the transaction group commit for arbitrarily long periods, which is why it does not. Just imagine writing a 1PB file. It would likely take a long time (days?) and it is just not reasonable to block the transaction group commit until it finishes.

That said, for your example, you will often have all of the writes go into the same transaction group commit, such that it becomes ACID, but this is not a strict guarantee. The maximum atomic write size on ZFS is 32MB, assuming alignment. If the write is not aligned to the record size, it will be smaller, as per:

https://github.com/openzfs/zfs/blob/6af8db61b1ea489ade2d5344...

52. shtripok ◴[05 Jul 25 22:27 UTC] No.44476096{5}[source]▶

>>44469707 #

Let's revert your question: why should raid be a separate level at all?

53. londons_explore ◴[05 Jul 25 22:32 UTC] No.44476116{6}[source]▶

>>44473240 #

The problem is that if the SSD suffers a power failure and reverts a 1 megabyte block of metadata to the way it was yesterday, the filesystem won't see that as corruption - since all the checksums will match.

Yet all the pointers in that metadata will point to data which no longer exists, and your filesystem will be destroyed.

54. shtripok ◴[05 Jul 25 23:09 UTC] No.44476293{5}[source]▶

>>44467133 #

On some of my zfs servers, the number of snapshots (mostly periodic, rotated — hour, day, month, updates, data maintenance work) is 10-12 thousand. LVM can't do that.

55. throw0101d ◴[06 Jul 25 01:04 UTC] No.44476875{5}[source]▶

>>44469707 #

> Why is this in the fs layer then?

Define "fs layer". ZFS has multiple layers with-in it:

The "file system" that most people interact with (for things like homedirs) is actually a layer with-in ZFS' architecture, and is called the ZFS POSIX layer (ZPL). It exposes a POSIX file system, and take the 'tradition' Unix calls and creates objects. Those objects are passed to the Data Management Unit (DMU), which then passed them down to Storage Pool Allocator (SPA) layer which actually manages the striping, redundancy, etc.

* https://ibug.io/blog/2023/10/zfs-block-size/

There was a bit of a 'joke' back in the day about ZFS being a "layering violation" because it subsumed into itself RAID, volume management, and a file system, instead of having each in a separate software packages:

* https://web.archive.org/web/20070508214221/https://blogs.sun...

* https://lildude.co.uk/zfs-rampant-layering-violation

The ZPL is not used all the time: one can create a block device ("zvol") and put swap or iSCSI on it. The Lustre folks have their own layer that hooks into the DMU and doesn't bother with POSIX semantics:

* https://wiki.lustre.org/ZFS_OSD_Hardware_Considerations

* https://www.eofs.eu/wp-content/uploads/2024/02/21_andreas_di...

56. throw0101d ◴[06 Jul 25 01:19 UTC] No.44476968{7}[source]▶

>>44470732 #

> The disk is going to report an uncorrected error for one of them.

Emperical evidence has shown otherwise: I have regularly gotten checksum error reports that ZFS has complained about during a scrub.

The ZFS developers have said in interviews that disks, when asked from LBA 123 have returned the contents of LBA 234 (due to disk firmware bugs): the on-disk checksum for 234 is correct, and so the bits were passed up the stack, but that's not the data that the kernel/ZFS asked for. It is only be verifying at the file system layer than the problem was caught (because at the disk layer things were "fine").

A famous paper that used Google's large quantity of drives as a 'sample population' mentions file system-level checks:

* https://www.cs.toronto.edu/~bianca/papers/fast08.pdf

See also the Google File System paper (§5.2 Data Integrity):

* https://research.google/pubs/the-google-file-system/

Trusting drives is not wise.

↑