Most active commenters

timewizard(4)
yjftsjthsd-h(4)
throw0101d(3)

Popular/hot comments

>>44467818 #
>>44469707 #

←back to thread

Bcachefs may be headed out of the kernel

(lwn.net)

Show context

msgodel ◴[04 Jul 25 17:53 UTC] No.44466535[source]▶

>>44464396 (OP) #

The older I get the more I feel like anything other than the ExtantFS family is just silly.

The filesystem should do files, if you want something more complex do it in userspace. We even have FUSE if you want to use the Filesystem API with your crazy network database thing.

replies(3): >>44466685 #>>44466895 #>>44467306 #

yjftsjthsd-h ◴[04 Jul 25 18:42 UTC] No.44466895[source]▶

>>44466535 #

I mean, I'd really like some sort of data error detection (and ideally correction). If a disk bitflips one of my files, ext* won't do anything about it.

replies(3): >>44467338 #>>44468600 #>>44469211 #

1. timewizard ◴[04 Jul 25 19:49 UTC] No.44467338[source]▶

>>44466895 #

> some sort of data error detection (and ideally correction).

That's pretty much built into most mass storage devices already.

> If a disk bitflips one of my files

The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.

> ext* won't do anything about it.

What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block? Without an fsck what mechanism do you suppose would manage these errors as they're discovered?

replies(3): >>44467434 #>>44467818 #>>44468075 #

2. throw0101d ◴[04 Jul 25 20:02 UTC] No.44467434[source]▶

>>44467338 (TP) #

>> > some sort of data error detection (and ideally correction).

> That's pretty much built into most mass storage devices already.

And ZFS has shown that it is not sufficient (at least for some use-cases, perhaps less of a big deal for 'residential' users).

> The likelihood and consequence of this occurring is in many situations not worth the overhead of adding additional ECC on top of what the drive does.

Not worth it to whom? Not having the option available at all is the problem. I can do a zfs set checksum=off pool_name/dataset_name if I really want that extra couple percentage points of performance.

> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?

Depends on the data involved: if it's part of the file system tree metadata there are often multiple copies even for a single disk on ZFS. So instead of the kernel consuming corrupted data and potentially panicing (or going off into the weeds) it can find a correct copy elsewhere.

If you're in a fancier configuration with some level of RAID, then there could be other copies of the data, or it could be rebuilt through ECC.

With ext*, LVM, and mdadm no such possibility exists because there are no checksums at any of those layers (perhaps if you glom on dm-integrity?).

And with ZFS one can set copies=2 on a per-dataset basis (perhaps just for /home?), and get multiple copies strewn across the disk: won't save you from a drive dying, but could save you from corruption.

replies(2): >>44468039 #>>44469707 #

3. ars ◴[04 Jul 25 20:54 UTC] No.44467818[source]▶

>>44467338 (TP) #

> The likelihood .. of this occurring

That's 10^14 bits for a consumer drive. That's just 12TB. A heavy user (lots of videos or games) would see a bit flip a couple times a year.

replies(3): >>44468204 #>>44469358 #>>44469681 #

4. yjftsjthsd-h ◴[04 Jul 25 21:34 UTC] No.44468039[source]▶

>>44467434 #

> (perhaps if you glom on dm-integrity?).

I looked at that, in hopes of being able to protect my data. Unfortunately, I considered this something of a fatal flaw:

> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed.

- https://wiki.archlinux.org/title/Dm-integrity

5. yjftsjthsd-h ◴[04 Jul 25 21:40 UTC] No.44468075[source]▶

>>44467338 (TP) #

To your first couple points: I trust hardware less than you.

> What should it do? Blindly hand you the data without any indication that there's a problem with the underlying block?

Well, that's what it does now, and I think that's a problem.

> Without an fsck what mechanism do you suppose would manage these errors as they're discovered?

Linux can fail a read, and IMHO should do so if it cannot return correct data. (I support the ability to override this and tell it to give you the corrupted data, but certainly not by default.) On ZFS, if a read fails its checksum, the OS will first try to get a valid copy (ex. from a mirror or if you've set copies=2), and then if the error can't be recovered then the file read fails and the system reports/records the failure, at which point the user should probably go do a full scrub (which for our purposes should probably count as fsck) and restore the affected file(s) from backup. (Or possibly go buy a new hard drive, depending on the extent of the problem.) I would consider that ideal.

6. magicalhippo ◴[04 Jul 25 22:01 UTC] No.44468204[source]▶

>>44467818 #

I do monthly scrubs on my NAS, I have 8 14-20TB drives that are quite full.

According to that 10^14 metric I should see read errors just about every month. Except I have just about zero.

Current disks are ~4 years, runs 24/7, and excluding a bad cable incident I've had a single case of a read error (recoverable, thanks ZFS).

I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.

replies(2): >>44469199 #>>44474491 #

7. ars ◴[05 Jul 25 00:38 UTC] No.44469199{3}[source]▶

>>44468204 #

If you are using enterprise drives those are 10^16, so that might explain it.

replies(1): >>44469334 #

8. magicalhippo ◴[05 Jul 25 01:08 UTC] No.44469334{4}[source]▶

>>44469199 #

Fair, newest ones are, but two of my older current drives are IronWolfs 16TB which are 10^15 in the specs[1], and they've been running for 5.4 years. Again without any read errors, monthly scrubs, and of course daily use.

And before that I have been using 8x WD Reds 3TB for 6-7 years, which have 10^14 in the specs[2], and had the same experience with those.

Yes smaller size, but I ran scrubbing on those biweekly, and over so many years?

[1]: https://www.seagate.com/files/www-content/datasheets/pdfs/ir...

[2]: https://documents.westerndigital.com/content/dam/doc-library...

9. Dylan16807 ◴[05 Jul 25 01:16 UTC] No.44469358[source]▶

>>44467818 #

I'm not really sure how you're supposed to interpret those error rates. The average read error probably has a lot more than 1 flipped bit, right? And if the average error affects 50 bits, then you'd expect 50x fewer errors? But I have no idea what the actual histogram looks like.

10. timewizard ◴[05 Jul 25 02:31 UTC] No.44469681[source]▶

>>44467818 #

Is that raw error rate or uncorrected error rate?

11. timewizard ◴[05 Jul 25 02:36 UTC] No.44469707[source]▶

>>44467434 #

> it can find a correct copy elsewhere.

Which implies you can already correct errors through a simple majority mechanism.

> or it could be rebuilt through ECC.

So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?

replies(3): >>44469858 #>>44476096 #>>44476875 #

12. yjftsjthsd-h ◴[05 Jul 25 03:12 UTC] No.44469858{3}[source]▶

>>44469707 #

> Which implies you can already correct errors through a simple majority mechanism.

I don't think so? You set copies=2, and the disk says that your file starts with 01010101, except that the second copy says your file starts with 01010100. How do you tell which one is right? For that matter, even with only one copy ex. ZFS can tell that what it has is wrong even if it can't fix it, and flagging the error is still useful.

> So just by having the appropriate level of RAID you automatically solve the problem. Why is this in the fs layer then?

Similarly, you shouldn't need RAID to catch problems, only (potentially) to correct them. I do agree that it doesn't necessarily have to be in the FS layer, but AFAIK Linux doesn't have any other layers that do a good job of it (as mentioned above, dm-integrity exists but halving the write speed is a pretty big problem).

replies(1): >>44470732 #

13. timewizard ◴[05 Jul 25 07:18 UTC] No.44470732{4}[source]▶

>>44469858 #

> I don't think so?

The disk is going to report an uncorrected error for one of them.

replies(1): >>44476968 #

14. ryao ◴[05 Jul 25 18:15 UTC] No.44474491{3}[source]▶

>>44468204 #

> I suspect those URE numbers are made by the manufacturers figuring out they can be sure the disk will do 10^14, but they don't actually try to find the real number because 10^14 is good enough.

I am inclined to agree. However, I have one thought to the contrary. When a mechanical drive is failing, you tend to have debris inside the drive hitting the platters, causing damage that creates more debris, accelerating the drive’s eventual death, with read errors becoming increasingly common while it happens. When those are included in averages, the 10^14 might very well be accurate. I have not done any rigorous analysis to justify this thought and I do not have the data to be able to do that analysis. It is just something that occurs to me that might justify the 10^14 figure.

15. shtripok ◴[05 Jul 25 22:27 UTC] No.44476096{3}[source]▶

>>44469707 #

Let's revert your question: why should raid be a separate level at all?

16. throw0101d ◴[06 Jul 25 01:04 UTC] No.44476875{3}[source]▶

>>44469707 #

> Why is this in the fs layer then?

Define "fs layer". ZFS has multiple layers with-in it:

The "file system" that most people interact with (for things like homedirs) is actually a layer with-in ZFS' architecture, and is called the ZFS POSIX layer (ZPL). It exposes a POSIX file system, and take the 'tradition' Unix calls and creates objects. Those objects are passed to the Data Management Unit (DMU), which then passed them down to Storage Pool Allocator (SPA) layer which actually manages the striping, redundancy, etc.

* https://ibug.io/blog/2023/10/zfs-block-size/

There was a bit of a 'joke' back in the day about ZFS being a "layering violation" because it subsumed into itself RAID, volume management, and a file system, instead of having each in a separate software packages:

* https://web.archive.org/web/20070508214221/https://blogs.sun...

* https://lildude.co.uk/zfs-rampant-layering-violation

The ZPL is not used all the time: one can create a block device ("zvol") and put swap or iSCSI on it. The Lustre folks have their own layer that hooks into the DMU and doesn't bother with POSIX semantics:

* https://wiki.lustre.org/ZFS_OSD_Hardware_Considerations

* https://www.eofs.eu/wp-content/uploads/2024/02/21_andreas_di...

17. throw0101d ◴[06 Jul 25 01:19 UTC] No.44476968{5}[source]▶

>>44470732 #

> The disk is going to report an uncorrected error for one of them.

Emperical evidence has shown otherwise: I have regularly gotten checksum error reports that ZFS has complained about during a scrub.

The ZFS developers have said in interviews that disks, when asked from LBA 123 have returned the contents of LBA 234 (due to disk firmware bugs): the on-disk checksum for 234 is correct, and so the bits were passed up the stack, but that's not the data that the kernel/ZFS asked for. It is only be verifying at the file system layer than the problem was caught (because at the disk layer things were "fine").

A famous paper that used Google's large quantity of drives as a 'sample population' mentions file system-level checks:

* https://www.cs.toronto.edu/~bianca/papers/fast08.pdf

See also the Google File System paper (§5.2 Data Integrity):

* https://research.google/pubs/the-google-file-system/

Trusting drives is not wise.

↑