Most active commenters
  • koverstreet(14)
  • p_l(9)
  • ajross(5)
  • (4)
  • yjftsjthsd-h(4)
  • cyphar(4)
  • m-p-3(4)
  • toast0(3)
  • magicalhippo(3)
  • nullc(3)

←back to thread

214 points ksec | 112 comments | | HN request time: 0.84s | source | bottom
1. betaby ◴[] No.45076609[source]
The sad part, that despite the years of the development BTRS never reached the parity with ZFS. And yesterday's news "Josef Bacik who is a long-time Btrfs developer and active co-maintainer alongside David Sterba is leaving Meta. Additionally, he's also stepping back from Linux kernel development as his primary job." see https://www.phoronix.com/news/Josef-Bacik-Leaves-Meta

There is no 'modern' ZFS-like fs in Linux nowadays.

replies(4): >>45076793 #>>45076833 #>>45078150 #>>45080011 #
2. ibgeek ◴[] No.45076793[source]
This isn’t BTRFS
replies(3): >>45076826 #>>45076870 #>>45077235 #
3. NewJazz ◴[] No.45076826[source]
Btrfs is the closest in-tree bcachefs alternative.
4. ofrzeta ◴[] No.45076833[source]
Suse Linux Enterprise still uses Btrfs as the Root-FS, so it can't be that bad, right? What is Chris Mason actually doing these days? I did some googling and only found out that he was working on a tool called "rsched".
replies(4): >>45077496 #>>45078503 #>>45082115 #>>45089747 #
5. doubletwoyou ◴[] No.45076870[source]
This might not be directly about btrfs but bcachefs zfs and btrfs are the only filesystems for Linux that provide modern features like transparent compression, snapshots, and CoW.

zfs is out of tree leaving it as an unviable option for many people. This news means that bcachefs is going to be in a very weird state in-kernel, which leaves only btrfs as the only other in-tree ‘modern’ filesystem.

This news about bcachefs has ramifications about the state of ‘modern’ FSes in Linux, and I’d say this news about the btrfs maintainer taking a step back is related to this.

replies(1): >>45076955 #
6. ajross ◴[] No.45076955{3}[source]
Meh. This war was stale like nine years ago. At this point the originally-beaten horse has decomposed into soil. My general reply to this is:

1. The dm layer gives you cow/snapshots for any filesystem you want already and has for more than a decade. Some implementations actually use it for clever trickery like updates, even. Anyone who has software requirements in this space (as distinct from "wants to yell on the internet about it") is very well served.

2. Compression seems silly in the modern world. Virtually everything is already compressed. To first approximation, every byte in persistent storage anywhere in the world is in a lossy media format. And the ones that aren't are in some other cooked format. The only workloads where you see significant use of losslessly-compressible data are in situations (databases) where you have app-managed storage performance (and who see little value from filesystem choice) or ones (software building, data science, ML training) where there's lots of ephemeral intermediate files being produced. And again those are usages where fancy filesystems are poorly deployed, you're going to throw it all away within hours to days anyway.

Filesystems are a solved problem. If ZFS disappeared from the world today... really who would even care? Only those of us still around trying to shout on the internet.

replies(8): >>45076983 #>>45077056 #>>45077104 #>>45077510 #>>45077740 #>>45077819 #>>45078472 #>>45080577 #
7. anon-3988 ◴[] No.45076983{4}[source]
> Filesystems are a solved problem. If ZFS disappeared from the world today... really who would even care? Only those of us still around trying to shout on the internet.

Yeah nah, have you tried processing terabytes of data every day and storing them? It gets better now with DDR5 but bit flips do actually happen.

replies(3): >>45077066 #>>45077162 #>>45077439 #
8. pdimitar ◴[] No.45077056{4}[source]
> The dm layer gives you cow/snapshots for any filesystem you want already and has for more than a decade. Some implementations actually use it for clever trickery like updates, even.

O_o

Apparently I've been living under a rock, can you please show us a link about this? I was just recently (casually) looking into bolting ZFS/BTRFS-like partial snapshot features to simulate my own atomic distro where I am able to freely roll back if an update goes bad. Think Linux's Timeshift with something little extra.

replies(3): >>45077099 #>>45077128 #>>45077741 #
9. ◴[] No.45077066{5}[source]
10. tux3 ◴[] No.45077099{5}[source]
There are downsides to adding features in layers, as opposed to integrating them with the FS, but dm can do quite a lot:

https://docs.kernel.org/admin-guide/device-mapper/snapshot.h...

11. doubletwoyou ◴[] No.45077104{4}[source]
I know my own personal anecdote isn’t much, but I’ve noticed pretty good space savings on the order of like 100 GB from zstd compression and CoW on my personal disks with btrfs

As for the snapshots, things like LVM snapshots are pretty coarse, especially for someone like me where I run dm-crypt on top of LVM

I’d say zfs would be pretty well missed with its data integrity features. I’ve heard that btrfs is worse in that aspect, so given that btrfs saved my bacon with a dying ssd, I can only imagine what zfs does.

12. ◴[] No.45077128{5}[source]
13. bombcar ◴[] No.45077162{5}[source]
Bit flips can happen, and if it’s a problem you should have additional verification above the filesystem layer, even if using ZFS.

And maybe below it.

And backups.

Backups make a lot of this minor.

replies(1): >>45077286 #
14. zozbot234 ◴[] No.45077235[source]
Does btrfs still eat your data if you try to use its included RAID featureset? Does it still break in a major way if you're close to running out of disk space? What I'm seeing is that most major Linux distributions still default to non-btrfs options for their default install, generally ext4.
replies(1): >>45077351 #
15. toast0 ◴[] No.45077286{6}[source]
Backups are great, but don't help much if you backup corrupted data.

You can certainly add verification above and below your filesystem, but the filesystem seems like a good layer to have verification. Capturing a checksum while writing and verifying it while reading seems appropriate; zfs scrub is a convenient way to check everything on a regular basis. Personally, my data feels important enough to make that level of effort, but not important enough to do anything else.

replies(1): >>45077563 #
16. skibbityboop ◴[] No.45077351{3}[source]
Anecdotal but btrfs is the only filesystem I've lost data with (and it wasn't in a RAID configuration). That combined with the btrfs tools being the most aggressively bad management utilities out there* ensure that I'm staying with ext4/xfs/zfs for now.

*Coming from the extremely well thought out and documented zfs utilities to btrfs will have you wondering wtf fairly frequently while you learn your way around.

17. ajross ◴[] No.45077439{5}[source]
And once more, you're positing the lack of a feature that is available and very robust (c.f. "yell on the internet" vs. "discuss solutions to a problem"). You don't need your filesystem to integrate checksumming when dm/lvm already do it for you.
replies(2): >>45078432 #>>45079256 #
18. dmm ◴[] No.45077496[source]
btrfs is fine for single disks or mirrors. In my experience, the main advantages of zfs over btrfs is that ZFS has production ready raid5/6 like parity modes and has much better performance for small sync writes, which are common for databases and hosting VM images.
replies(2): >>45078384 #>>45084324 #
19. fluidcruft ◴[] No.45077510{4}[source]
One feature I like about ZFS and have not seen elsewhere is that you can have each filesystem within the pool use its own encryption keys but more importantly all of the pool's data integrity and maintenance protection (scrubs, migrations, etc) work with filesystems in their encrypted state. So you can boot up the full system and then unlock and access projects only as needed.

The dm stuff is one key for the entire partition and you can't check it for bitrot or repair it without the key.

20. ajross ◴[] No.45077563{7}[source]
FWIW, framed the way you do, I'd say the block device layer would be an *even better* place for that validation, no?

> Personally, my data feels important enough to make that level of effort, but not important enough to do anything else.

OMG. Backups! You need backups! Worry about polishing your geek cred once your data is on physically separate storage. Seriously, this is not a technology choice problem. Go to Amazon and buy an exfat stick, whatever. By far the most important thing you're ever going to do for your data is Back. It. Up.

Filesystem choice is, and I repeat, very much a yell-on-the-internet kind of thing. It makes you feel smart on HN. Backups to junky Chinese flash sticks are what are going to save you from losing data.

replies(2): >>45077728 #>>45078612 #
21. tptacek ◴[] No.45077728{8}[source]
Ok I think you're making a well-considered and interesting argument about devicemapper vs. feature-ful filesystems but you're also kind of personalizing this a bit. I want to read more technical stuff on this thread and less about geek cred and yelling. :)

I wouldn't comment but I feel like I'm naturally on your side of the argument and want to see it articulated well.

replies(1): >>45078215 #
22. dilyevsky ◴[] No.45077740{4}[source]
The other thing dm/lvm gives you is dogshit performance
23. tptacek ◴[] No.45077741{5}[source]
DM has targets that facilitate block-level snapshots, lazy cloning of filesystems, compression, &c. Most people interact with those features through LVM2. COW snapshots are basically the marquee feature of LVM2.
24. ThatPlayer ◴[] No.45077819{4}[source]
For me bcachefs provides a feature no other filesystem on Linux has: automated tiered storage. I've wanted this ever since I got an SSD more than 10 years ago, but filesystems move slow.

A block level cache like bcache (not fs) and dm-cache handles it less ideally, and doesn't leave the SSD space as usable space. As a home user, 2TB of SSDs is 2TB of space I'd rather have. ZFS's ZIL is similar, not leaving it as usable space. Btrfs has some recent work in differentiating drives to store metadata on the faster drives (allocator hints), but that only does metadata as there is no handling of moving data to HDDs over time. Even Microsoft's ReFS does tiered storage I believe.

I just want to have 1 or 2 SSDs, with 1 or 2 HDDs in a single filesystem that gets the advantages of SSDs with recently used files and new writes, and moves all the LRU files to the HDDs. And probably keep all the metadata on the SSDs too.

replies(1): >>45081671 #
25. jakebasile ◴[] No.45078150[source]
I just use ZFS. Canonical ships it and that's good enough for me on my personal machines.
26. ajross ◴[] No.45078215{9}[source]
I didn't really think it was that bad? But sure, point taken.

My goal was actually the same though: to try to short-circuit the inevitable platform flame by calling it out explicitly and pointing out that the technical details are sort of a solved problem.

ZFS argumentation gets exhausting, and has ever since it was released. It ends up as a proxy for Sun vs. Linux, GNU vs. BSD, Apple vs. Google, hippy free software vs. corporate open source, pick your side. Everyone has an opinion, everyone thinks it's crucially important, and as a result of that hyperbole everyone ends up thinking that ZFS (dtrace gets a lot of the same treatment) is some kind of magically irreplaceable technology.

And... it's really not. Like I said above if it disappeared from the universe and everyone had to use dm/lvm for the actual problems they need to solve with storage management[1], no one would really care.

[1] Itself an increasingly vanishing problem area! I mean, at scale and at the performance limit, virtually everything lives behind a cloud-adjacent API barrier these days, and the backends there worry much more about driver and hardware complexity than they do about mere "filesystems". Dithering about individual files on individual systems in the professional world is mostly limited to optimizing boot and update time on client OSes. And outside the professional world it's a bunch of us nerds trying to optimize our movie collections on local networks; realistically we could be doing that on something as awful NTFS if we had to.

replies(1): >>45078422 #
27. riku_iki ◴[] No.45078384{3}[source]
> has much better performance for small sync writes

I spent some time researching this topic, and in all benchmarks I've seen and my personal tests btrfs is faster or much faster: https://www.reddit.com/r/zfs/comments/1i3yjpt/very_poor_perf...

replies(1): >>45083828 #
28. nh2 ◴[] No.45078422{10}[source]
How can I, with dm/lvm:

* For some detected corruption, be told directly which files are affected?

* Get filesystem level snapshots that are guaranteed to be consistent in the way ZFS and CephFS snapshots guarantee?

replies(1): >>45078527 #
29. yjftsjthsd-h ◴[] No.45078432{6}[source]
> You don't need your filesystem to integrate checksumming when dm/lvm already do it for you.

https://wiki.archlinux.org/title/Dm-integrity

> It uses journaling for guaranteeing write atomicity by default, which effectively halves the write speed

I'd really rather not do that, thanks.

30. yjftsjthsd-h ◴[] No.45078472{4}[source]
> Compression seems silly in the modern world. Virtually everything is already compressed.

IIRC my laptop's zpool has a 1.2x compression ratio; it's worth doing. At a previous job, we had over a petabyte of postgres on ZFS and saved real money with compression. Hilariously, on some servers we also improved performance because ZFS could decompress reads faster than the disk could read.

replies(3): >>45080507 #>>45080922 #>>45081733 #
31. yjftsjthsd-h ◴[] No.45078503[source]
I used btrfs a few years ago, on OpenSUSE, because I also thought that would work, and it was on a single disk. It lost my root filesystem twice.
replies(1): >>45083436 #
32. ajross ◴[] No.45078527{11}[source]
On urging from tptacek I'll take that seriously and not as flame:

1. This is misunderstanding how device corruption works. It's not and can't ever be limited to "files". (Among other things: you can lose whole trees if a directory gets clobbered, you'd never even be able to enumerate the "corrupted files" at all!). All you know (all you can know) is that you got a success and that means the relevant data and metadata matched the checksums computed at write time. And that property is no different with dm. But if you want to know a subset of the damage just read the stderr from tar, or your kernel logs, etc...

2. Metadata robustness in the face of inconsistent updates (e.g. power loss!) is a feature provided by all modern filesystems, and ZFS is no more or less robust than ext4 et. al. But all such filesystems (ZFS included) will "lose data" that hadn't been fully flushed. Applications that are sensitive to that sort of thing must (!) handle this by having some level of "transaction" checkpointing (i.e. a fsync call). ZFS does absolutely nothing to fix this for you. What is true is that an unsynchronized snapshot looks like "power loss" at the dm level where it doesn't in ZFS. But... that's not useful for anyone that actually cares about data integrity, because you still have to solve the power loss problem. And solving the power loss problem obviates the need for ZFS.

replies(1): >>45078904 #
33. toast0 ◴[] No.45078612{8}[source]
I apprechiate the argument. I do have backups. Zfs makes it easy to send snapshots and so I do.

But I don't usually verify the backups, so there's that. And everything is in the same zip code for the most part, so one big disaster and I'll lose everything. C'est la vie.

replies(1): >>45082193 #
34. koverstreet ◴[] No.45078904{12}[source]
1 - you absolutely can and should walk reverse mappings in the filesystem so that from a corrupt block you can tell the user which file was corrupted.

In the future bcachefs will be rolling out auxiliary dirent indices for a variety of purposes, and one of those will be to give you a list of files that have had errors detected by e.g. scrub (we already generally tell you the affected filename in error messages)

2 - No, metadata robustness absolutely varies across filesystems.

From what I've seen, ext4 and bcachefs are the gold standard here; both can recover from basically arbitrary corruption and have no single points of failure.

Other filesystems do have single points of failure (notably btree roots), and btrfs and I believe ZFS are painfully vulnerable to devices with broken flush handling. You can blame (and should) blame the device and the shitty manufacturers, but from the perspective of a filesystem developer, we should be able to cope with that without losing the entire filesystem.

XFS is quite a bit better than btrfs, and I believe ZFS, because they have a ton of ways to reconstruct from redundant metadata if they lose a btree root, but it's still possible to lose the entire filesystem if you're very, very unlucky.

On a modern filesystem that uses b-trees, you really need a way of repairing from lost b-tree roots if you want your filesystem to be bulletproof. btrfs has 'dup' mode, but that doesn't mean much on SSDs given that you have no control over whether your replicas get written to the same erase unit.

Reiserfs actually had the right idea - btree node scan, and reconstruct your interior nodes if necessary. But they gave that approach a bad name; for a long time it was a crutch for a buggy b-tree implementation, and they didn't seed a filesystem specific UUID into the btree node magic number like bcachefs does, so it could famously merge a filesystem from a disk image with the host filesystem.

bcachefs got that part right, and also has per-device bitmaps in the superblock for 'this range of the device has btree nodes' so it's actually practical even if you've got a massive filesystem on spinning rust - and it was introduced long after the b-tree implementation was widely deployed and bulletproof.

replies(2): >>45079266 #>>45079679 #
35. khimaros ◴[] No.45079256{6}[source]
i'm not one for internet arguments and really just want solutions. maybe you could point me at the details for a setup that worked for you?

based on my own testing, dm has a lot of footguns and, with some kernels, as little as 100 bytes of corruption to the underlying disk could render a dm-integrity volume completely unusable (requiring a full rebuild) https://github.com/khimaros/raid-explorations

replies(1): >>45082039 #
36. magicalhippo ◴[] No.45079266{13}[source]
> XFS is quite a bit better than btrfs, and I believe ZFS, because they have a ton of ways to reconstruct from redundant metadata if they lose a btree root

As I understand it ZFS also has a lot of redundant metatdata (copies=3 on anything important), and also previous uberblocks[1].

In what way is XFS better? Genuine question, not really familiar with XFS.

[1]: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSMetadata...

replies(1): >>45079344 #
37. koverstreet ◴[] No.45079344{14}[source]
I can't speak with any authority on ZFS, I know its structure the least out of all the major filesystems.

I do a ton of reading through forums gathering user input, and lots of people chime in with stories of lost filesystems. I've seen reports of lost filesystems with ZFS and I want to say I've seen them at around the same frequency of XFS; both are very rare.

My concern with ZFS is that they seem to have taken the same "no traditional fsck" approach as btrfs, favoring entirely online repair. That's obviously where we all want to be, but that's very hard to get right, and it's been my experience that if you prioritize that too much you miss the "disaster recovery" scenarios, and that seems to be what's happened with ZFS; I've read that if your ZFS filesystem is toast you need to send it to a data recovery service.

That's not something I would consider acceptable, fsck ought to be able to do anything a data recovery service would do, and for bcachefs it does.

I know the XFS folks have put a ton of outright paranoia into repair, including full on disaster recovery scenarios. It can't repair in scenarios where bcachefs can - but on the other hand, XFS has tricks that bcachefs doesn't, so I can't call bcachefs unequivocally better; we'd need to wait for more widespread usage and a lot more data.

replies(1): >>45082904 #
38. ◴[] No.45079679{13}[source]
39. tw04 ◴[] No.45080011[source]
There's literally ZFS-on-linux and it works great. And yes, I will once again say Linus is completely wrong about ZFS and the multiple times he's spoken about it, it's abundantly clear he's never used it or bothered to spend any time researching its features and functionality.

https://zfsonlinux.org/

replies(5): >>45080040 #>>45080220 #>>45081040 #>>45082703 #>>45084105 #
40. evanjrowley ◴[] No.45080040[source]
Sometimes I wonder how someone so talented could be so wrong about ZFS, and it makes me wonder if his negative responses to ZFS discussions could be a way of creating plausible deniability in case Oracle's lawyers ever learn how to spell ZFS.
replies(4): >>45080084 #>>45082153 #>>45082326 #>>45083316 #
41. wmf ◴[] No.45080084{3}[source]
If Linus has never touched ZFS that's not plausible deniability. That's actual deniability.
replies(1): >>45080109 #
42. MathMonkeyMan ◴[] No.45080109{4}[source]
The most plausible deniability!
43. koverstreet ◴[] No.45080220[source]
ZFS deserves an absolutely legendary amount of respect for showing us all what a modern filesystem should be - the papers they wrote, alone, did the entire filesystem world such a massive service by demonstrating the possibilities of full data integrity and why we want it, and then they showed it could be done.

But there's a ton of room for improvement beyond what ZFS did. ZFS was a very conservative design in a lot of ways (rightly so! so many ambitious projects die because of second system syndrome); notably, it's block based and doesn't do extents - extents and snapshots are a painfully difficult combination.

Took me years to figure that one out.

My hope for bcachefs has always been to be a real successor to ZFS, with better and more flexible management, better performance, and even better robustness and reliability.

Long road, but the work continues.

replies(2): >>45080462 #>>45082319 #
44. TheAceOfHearts ◴[] No.45080462{3}[source]
> But there's a ton of room for improvement beyond what ZFS did.

Say more? I can't say I've really thought that much about filesystems and I'm curious in what direction you think they could be taken if time and budget weren't an issue.

replies(2): >>45080669 #>>45081484 #
45. adzm ◴[] No.45080507{5}[source]
> we also improved performance because ZFS could decompress reads faster than the disk could read

This is my favorite side effect of compression in the right scenarios. I remember getting a huge speed up in a proprietary in-memory data structure by using LZO (or one of those fast algorithms) which outperformed memcpy, and this was already in memory so no disk io involved! And used less than a third of the memory.

46. trashface ◴[] No.45080577{4}[source]
> And the ones that aren't are in some other cooked format.

Maybe, if you never create anything. I make a lot of game art source and much of that is in uncompressed formats. Like blend files, obj files, even DDS can compress, depending on the format and data, due to the mip maps inside them. Without FS compression it would be using GBs more space.

I'm not going to individually go through and micromanage file compression even with a tool. What a waste of time, let the FS do it.

47. koverstreet ◴[] No.45080669{4}[source]
that would be bcachefs :)

It's an entirely clean slate design, and I spent years taking my time on the core planning out the design; it's as close to perfect as I can make it.

The only things I can think of that I would change or add given unlimited time and budget: - It should be written in Rust, and even better a Rust + dependent types (which I suspect could be done with proc macros) for formal verification. And cap'n proto for on disk data structures (which still needs Rust improvements to be as ergonomic as it should be) would also be a really nice improvement.

- More hardening; the only other thing we're lacking is comprehensive fault injection testing of on disk errors. It's sufficiently battle hardened that it's not a major gap, but it really should happen at some point.

- There's more work to be done in bitrot prevention: data checksums really need to be plumbed all the way into the pagecache

I'm sure we'll keep discovering new small ways to harden, but nothing huge at this point.

Some highlights: - It has more defense in depth than any filesystem I know of. It's as close to impossible to have unrecoverable data loss as I think can really be done in a practical production filesystem - short of going full immutable/append only.

- Closest realization of "filesystem as a database" that I know of

- IO path options (replication level, compression, etc.) can be set on a per file or directory basis: I'm midway through a project extending this to do some really cool stuff, basically data management is purely declarative.

- Erasure coding is much more performant than ZFS's

- Data layout is fully dynamic, meaning you can add/remove devices at will, it just does the right thing - meaning smoother device management than ZFS

- The way the repair code works, and tracking of errors we've seen - fantastic for debugability

- Debugability and introspection are second to none: long bug hunts really aren't a thing in bcachefs development because you can just see anything the system is doing

There's still lots of work to do before we're fully at parity with ZFS. Over the next year or two I should be finishing erasure coding, online fsck, failure domains, lots more management stuff... there will always be more cool projects just over the horizon

replies(5): >>45082583 #>>45083071 #>>45083570 #>>45084005 #>>45084130 #
48. pezezin ◴[] No.45080922{5}[source]
How do you get a PostgreSQL database to grow to one petabyte? The maximum table size is 32 TB o_O
replies(2): >>45081490 #>>45083233 #
49. quotemstr ◴[] No.45081040[source]
> works great.

I will not use or recommend ZFS on _any_ OS until they solve the double page cache problem. A filesystem has no business running its own damned page cache that duplicates the OS one. I don't give a damn if ZFS has a fancy eviction algorithm. ARC's patent is expired. Go port it to mainline Linux if it's not that good. Just don't make inner platform.

replies(1): >>45082839 #
50. cyphar ◴[] No.45081484{4}[source]
You're replying to the bcachefs author, I expect his response will be fairly obvious. ;)
51. olavgg ◴[] No.45081490{6}[source]
Probably by using partitioning.
52. guenthert ◴[] No.45081671{5}[source]
> automated tiered storage. I've wanted this ever since I got an SSD more than 10 years ago, but filesystems move slow.

You were not alone. However, things changed, namely SSD continued to become cheaper and grew in capacity. I'd think most active data is these days on SSDs (certainly in most desktops, most servers which aren't explicit file or DB servers and all mobile and embedded devices), the role of spinning rust being more and more archiving (if found in a system at all).

replies(2): >>45089604 #>>45090758 #
53. bionsystem ◴[] No.45081733{5}[source]
The performance gain from compression (replacing IO with compute) is not ironic, it was seen as a feature for the various NAS that Sun (and after them Oracle) developped around ZFS.
54. justincormack ◴[] No.45082039{7}[source]
Well the intention of the integrity things is to preserve integrity that is an explicit choice, in particular for encrypted data. You definitely need a backup strategy.
55. petre ◴[] No.45082115[source]
We use OpenSuSE and I always switch the installs to ext4. No fancy features, but always works, doesn't lose my root fs.
56. chao- ◴[] No.45082153{3}[source]
How many years has it been since Ubuntu started shipping ZFS, purportedly in violation of whatever legal fears the kernel team has? Four years? Five years?

I obviously have nothing like inside knowledge, but I assume the reason there have not been lawsuits over this, is that whoever could bring one (would it be only Oracle?) expects there are even-odds that they would lose? Thus the risk of setting an adverse precedent isn't worth the damages they might be awarded from suing Canonical?

replies(2): >>45082236 #>>45082335 #
57. petre ◴[] No.45082193{9}[source]
What good is a backup if you can't restore it?
replies(1): >>45083405 #
58. tuna74 ◴[] No.45082236{4}[source]
It could be a long term strategy by Oracle to be able to sue IBM and other big companies distributing Linux with ZFS built in. If Oracle want people to use ZFS they can just relicense the code they have copyright on.
replies(2): >>45082342 #>>45086629 #
59. p_l ◴[] No.45082319{3}[source]
Can you explain your definition of "extent"? Because under every definition I dealt with in filesystems before, ZFS is extent based at the lower layer, and flat out object storage system (closer to S3) at upper layer.
60. p_l ◴[] No.45082326{3}[source]
Oracle lawyers know how to spell ZFS.

But Sun ensured that they can only gnash their teeth.

The source of "license incompatibility" btw is the same as from using GPLv3 code in kernel - CDDL adds an extra restriction in form of patent protections (just like Apache 2)

61. p_l ◴[] No.45082335{4}[source]
The legal issues between Linux kernel and ZFS are that Linux license does not allow incorporating licenses with more restrictions - including anything that puts protections against being sued for patented code contributed by license giver.
replies(1): >>45083326 #
62. p_l ◴[] No.45082342{5}[source]
Oracle does not have copyright on OpenZFS code - only on the version in Solaris.

The code in OpenZFS and Solaris has diverged after Oracle closed OpenSolaris.

replies(1): >>45083147 #
63. Icathian ◴[] No.45082583{5}[source]
I happen to work at a company that uses a ton of capnp internally and this is the first time I've seen it mentioned much outside of here. Would you mind describing what about it you think would make it a good fit for something like bcachefs?
replies(1): >>45082973 #
64. mort96 ◴[] No.45082703[source]
To me, ZFS on Linux is extremely uninteresting except for the specific use case of a NAS with a bunch of drives. I don't want to deal with out-of-tree filesystems unless I absolutely have to. And even on a NAS, I would want the root partition to be ext4 or btrfs or something else that's in the kernel.
replies(1): >>45083258 #
65. IgorPartola ◴[] No.45082839{3}[source]
That’s such a weird hill to die on. It’s like refusing to drive a car because it uses head bolts instead of head studs in an engine.
66. p_l ◴[] No.45082904{15}[source]
The lack of traditional 'fsck' is because its operation would be exact same as normal driver operation. The most extreme case involves a very obscure option that lets you explicitly rewind transactions to one you specify, which I've seen used to recover a broken driver upgrade that led to filesystem corruption in ways that most FSCK just barf on, including XFS'

For low-level meddling and recovery, there's a filesystem debugger that understands all parts of ZFS and can help for example identifying previous uberblock that is uncorrupted, or recovering specific data, etc.

replies(1): >>45083477 #
67. koverstreet ◴[] No.45082973{6}[source]
Cap'n proto is basically a schema language that gets you a well defined in-memory representation that's just as good as if you were writing C structs by hand (laboriously avoiding silent padding, carefully using types with well defined sizes) - without all the silent pitfalls of doing it manually in C.

It's extremely well thought out, it's minimalist in all the right ways; I've found the features and optimizations it has to be things that are borne out of real experience that you would want end up building yourself in any real world system.

E.g. it gives you the ability to add new fields without breaking compatibility. That's the right way to approach forwards/backwards compatibility, and it's what I do in bcachefs and if we'd been able to just use cap'n proto it would've taken out a lot of manual fiddly work.

The only blocker to using it more widely in my own code is that it's not sufficiently ergonomic in Rust - Rust needs lenses, from Swift.

68. nullc ◴[] No.45083071{5}[source]
> - Erasure coding is much more performant than ZFS's

any plans for much lower rates than typical raid?

Increasingly modern high density devices are having block level failures at non-trivial rates instead of or in addition to whole device failures. A file might be 100,000 blocks long, adding 1000 blocks of FEC would expand it 1% but add tremendous protection against block errors. And can do so even if you have a single piece of media. Doesn't protect against device failures, sure, though without good block level protection device level protection is dicey since hitting some block level error when down to minimal devices seems inevitable and having to add more and more redundant devices is quite costly.

replies(1): >>45083444 #
69. messe ◴[] No.45083147{6}[source]
> The code in OpenZFS and Solaris has diverged after Oracle closed OpenSolaris.

Diverged. Not rewritten entirely.

replies(1): >>45084282 #
70. yjftsjthsd-h ◴[] No.45083233{6}[source]
Cumulative; dozens of machines with a combined database size over a PB even though each box only had like 20 TB.
71. rob_c ◴[] No.45083258{3}[source]
> the specific use case of a NAS with a bunch of drives

Aka a way bigger part of the industry than it should probably still be ;)

72. aaronmdjones ◴[] No.45083316{3}[source]
As far as I know, the license incompatibility is on the GPL side of the equation. As in, shipping a kernel with the ZoL functionality is a violation of the GPL, not the CDDL. Thus, Oracle would not be able to sue Canonical (Edit: or, rather, have any reasonable expectation of winning this battle), as they have no standing. A copyright holder of some materially significant portion of the GPL code of the kernel would have to sue Canonical for breaching the GPL by including CDDL code.

I am not a lawyer.

replies(2): >>45084996 #>>45089320 #
73. chao- ◴[] No.45083326{5}[source]
I am aware of that. I did a bad job phrasing my post, and it came off sounding more confident than I actually intended. I have two questions: (1) What are the expected consequences of a violation? (2) Why haven't any consequences occurred yet?

My understanding is that Canonical is shipping ZFS with Ubuntu. Or do I misunderstand? Has Canonical not actually done the big, bad thing of distributing the Linux kernel with ZFS? Did they find some clever just-so workaround so as to technically not be violation of the Linux kernel's license terms?

Otherwise, if Canonical has actually done the big, bad thing, who has standing to bring suit? Would the Linux Foundation sue Canonical, or would Oracle?

I ask this in all humility, and I suspect there is a chance that my questions are nonsense and I don't know enough to know why.

replies(1): >>45085742 #
74. toast0 ◴[] No.45083405{10}[source]
Well, I expect that I can restore it, and that expectation has been good enough thus far. :p
75. ◴[] No.45083436{3}[source]
76. koverstreet ◴[] No.45083444{6}[source]
It's been talked about. I've seen some interesting work to use just a normal checksum to correct single bit errors.

If there's an optimized implementation we can use in the kernel, I'd love to add it. Even on modern hardware, we do see bit corruption in the wild, it would add real value.

replies(1): >>45083475 #
77. nullc ◴[] No.45083475{7}[source]
It's pretty straight forward to use a normal checksum to correct single or even more bit errors (depending on the block size, choice of checksum, etc). Though I expect those bit errors are bus/ram, and hopefully usually transient. If there is corruption on the media, the whole block is usually going to be lost because any corruptions means that its internal block level FEC has more errors than it can fix.

I was more thinking along the lines of adding dozens or hundreds of correction blocks to a whole file, along the lines of par (though there are much faster techniques now).

replies(1): >>45083563 #
78. koverstreet ◴[] No.45083477{16}[source]
Rewinding transactions is cool. Bcachefs has that too :)

What happens on ZFS if you lose all your alloc info? Or are there other single points of failure besides the ublock in the on disk format?

replies(1): >>45084552 #
79. koverstreet ◴[] No.45083563{8}[source]
You'd think that, wouldn't you? But there are enough moving parts in the IO stack below the filesystem that we do see bit errors. I don't have enough data to do correlations and tell you likely causes, but they do happen.

I think SSDs are generally worse than spinning rust (especially enterprise grade SCSI kit), the hard drive vendors have been at this a lot longer and SSDs are massively more complicated. From the conversations I've had with SSD vendors, I don't think they've put the some level of effort into making things as bulletproof as possible yet.

replies(1): >>45083695 #
80. ZenoArrow ◴[] No.45083570{5}[source]
> Closest realization of "filesystem as a database" that I know of

More so than BFS?

https://en.m.wikipedia.org/wiki/Be_File_System

"Like its predecessor, OFS (Old Be File System, written by Benoit Schillings - formerly BFS), it includes support for extended file attributes (metadata), with indexing and querying characteristics to provide functionality similar to that of a relational database."

replies(1): >>45083752 #
81. nullc ◴[] No.45083695{9}[source]
One thing to keep in mind is that correction always comes as some expense of detection.

Generally a code that can always detect N errors can only always correct N/2 errors. So you detect an errored block, you correct up to N/2 errors. The block now passes but if the block actually had N errors, your correction will be incorrect and you now have silent corruption.

The solution to this is just to have an excess of error correction power and then don't use all of it. But that can be hard to do if you're trying to shoehorn it into an existing 32-bit crc.

How big are the blocks that the CRC units cover in bcachefs?

replies(1): >>45084024 #
82. koverstreet ◴[] No.45083752{6}[source]
What BFS did is very cool, and I hope to add that to bcachefs someday.

But I'm talking more about the internals than external database functionality; the inner workings are much more fundamental.

bcachefs internally is structured more like a relational database than a traditional Unix filesystem, where everything hangs off the inode. In bcachefs, there's an extents btree (read: table), an inodes btree, a dirents btree, and a whole bunch of others - we're up to 20 (!).

There's transactions, where you can do arbitrary lookups, updates, and then commit, with all the database locking hidden from you; lookups within a transaction see uncommitted updates from that transaction. There's triggers, which are used heavily.

We don't have the full relational model - no SELECT or JOIN, no indices on arbitrary fields like with SQL (but you can do effectively the same thing with triggers, I do it all the time).

All the database/transactional primitives make the rest of the codebase much smaller and cleaner, and make feature development a lot easier than what you'd expect in other filesystems.

replies(1): >>45090117 #
83. dmm ◴[] No.45083828{4}[source]
Thanks for sharing! I just setup a fs benchmark system and I'll run your fio command so we can compare results. I have a question about your fio args though. I think "--ioengine=sync" and "--iodepth=16" are incompatible, in the sense that iodepth will only be 1.

"Note that increasing iodepth beyond 1 will not affect synchronous ioengines"[1]

Is there a reason you used that ioengine as opposed to, for example, "libaio" with a "--direct=1" flag?

[1] https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-...

replies(1): >>45084006 #
84. lifty ◴[] No.45084005{5}[source]
Thanks for bcachefs and all the hard work you’ve put in it. It’s truly appreciated and hope you can continue to march on and not give up on the in-kernel code, even if it means bowing to Linus.

On a different note, have you heard about prolly trees and structural sharing? It’s a newer data structure that allows for very cheap structural sharing and I was wondering if it would be possible to build an FS on top of it to have a truly distributed fs that can sync across machines.

replies(1): >>45084117 #
85. riku_iki ◴[] No.45084006{5}[source]
Intuition is that majority of software uses standard sync FS api..
86. koverstreet ◴[] No.45084024{10}[source]
bcachefs checksums (and compresses) at extent granularity, not block; encoded extents (checksummed/compressed) are limited to 128k by default.

This is a really good tradeoff in practice; the vast majority of applications are doing buffered IO, not small block O_DIRECT reads - that really only comes up in benchmarks :)

And it gets us better compression ratios and better metadata overhead.

We also have quite a bit of flexibility to add something bigger to the extent for FEC, if we need to - we're not limited to a 32/64 bit checksum.

87. m-p-3 ◴[] No.45084105[source]
I've recently started using OpenZFS after all these years, and after weighting all the pros and cons of BTRFS, mdadm, etc, ZFS is clearly on top for availability and resiliency.

Hopefull we can get to a point where Linux has a native, and first-class modern alternative to ZFS with BcacheFS.

88. koverstreet ◴[] No.45084117{6}[source]
I have not seen those...
89. m-p-3 ◴[] No.45084130{5}[source]
I'm saddened by this turn of event, but I hope this won't deter you from working on bcachefs on your own term and eventually see a reconciliation into the kernel at one point.

Thank you for your hard work.

90. m-p-3 ◴[] No.45084282{7}[source]
Sure, but Oracle cannot retroactively relicense the code already published before then. The cat's already out of the bag, and as long as the code from before the fork is used according to the original license, it's legal.
replies(1): >>45084758 #
91. m-p-3 ◴[] No.45084324{3}[source]
Context: I mostly dealt with RAID1 in a home NAS setup

A ZFS pool will remain available even in degraded mode, and correct me if I'm wrong but with BTRFS you mount the array through one of the volume that is part of the array and not the array itself.. so if that specific mounted volume happens to go down, the array becomes unavailable unmounted until you remount another available volume that is part of the array which isn't great for availability.

I thought about mitigating that by making an mdadm RAID1 formatted with BTRFS and mount the virtual volume instwad, but then you lose the ability to prevent bit rot, since BTRFS lose that visibility if it doesn't manage the array natively.

replies(1): >>45089592 #
92. magicalhippo ◴[] No.45084552{17}[source]
> What happens on ZFS if you lose all your alloc info?

According to this[1] old issue, it hasn't happened frequently enough to prioritize implementing a rebuild option, however one should be able to import the pool read-only and zfs send it to a different pool.

As far as I can tell that's status quo. I agree it is something that should be implemented at some point.

That said, certain other spacemap errors might be recoverable[2].

[1]: https://github.com/openzfs/zfs/issues/3210

[2]: https://github.com/openzfs/zfs/issues/13483#issuecomment-120...

replies(1): >>45085641 #
93. messe ◴[] No.45084758{8}[source]
I think you might have missed the point.

Yes. Oracle have that copyright.

That's the whole fucking point.

Anything from before the fork is still licensed (and pretty much everything after) is still under the CDDL which is possibly in conflict with the GPL.

replies(1): >>45085709 #
94. ajb ◴[] No.45084996{4}[source]
Oracle do also make GPL'd contributions to the Linux kernel. So by that reasoning , they would have standing.

It would be an interesting lawsuit as the judge might well ask why as copyright holder of ZFS they can't solve the problem they are suing over. But I think you underestimate the deviousness of oracle's legal dept.

replies(1): >>45085787 #
95. koverstreet ◴[] No.45085641{18}[source]
I take a harder line on repair than the ZFS devs, then :)

If I see an issue that causes a filesystem to become unavailable _once_, I'll write the repair code.

Experience has taught me that there's a good chance I'll be glad I did, and I like the peace of mind that I get from that.

And it hasn't been that bad to keep up on, thanks to lucky design decisions. Since bcachefs started out as bcache, with no persistent alloc info, we've always had the ability to fully rebuild alloc info, and that's probably the biggest and hardest one to get right.

You can metaphorically light your filesystem on fire with bcachefs, and it'll repair. It'll work with whatever is still there and get you a working filesystem again with the minimum possible data loss.

replies(1): >>45086586 #
96. p_l ◴[] No.45085709{9}[source]
Oracle can't do anything. They can't relicense code that was already released as CDDL in any form other than what they did when they closed down Solaris.

The CDDL being unacceptable is the same issue that GPL3 or Apache is unacceptable - unlike GPLv2, CDDL mandates patent licensing as far as the code is considered.

replies(1): >>45089367 #
97. p_l ◴[] No.45085742{6}[source]
Oracle has no standing.

Additionally, GPLv2 does not prevent shipping ZFS combined with GPL code, because CDDL code is not derivative work of GPLv2 code. So it's legal to ship.

It could be problematic to upstream, because kernel development would demand streamlining to the point that the code would be derivative.

Additionally, two or three kernel contributors decided that the long standing consensus on derivative work is not correct and sued Canonical. So far nothing happened out of that, Los Alamos National Laboratory also laughed it off.

replies(1): >>45087150 #
98. koverstreet ◴[] No.45085787{5}[source]
If it ended up before Alsup we'd be fine.

Venue shopping being what it is, though...

99. magicalhippo ◴[] No.45086586{19}[source]
As I said I do think ZFS is great, but there are aspects where it's quite noticeable it was born in an enterprise setting. That sending, recreating and restoring the pool is a sufficient disaster recovery plan to not warrant significant development is one of those aspects.

As I mentioned in the other subthread, I do think your commitment to help your users is very commendable.

replies(1): >>45087228 #
100. highpost ◴[] No.45086629{5}[source]
Ubuntu ships OpenZFS as a separate prebuilt kernel module for ZFS (zfs-dkms). Interestingly, they also have ZFS support in GRUB to support booting from ZFS:

  * read-only and minimal
  * fully aware of different Linux boot environments
  * GPLv3 license compatible, clean-room implementation by the OpenSolaris/Illumos team. The implementation predates Ubuntu’s interest.
101. tzs ◴[] No.45087150{7}[source]
> Additionally, GPLv2 does not prevent shipping ZFS combined with GPL code, because CDDL code is not derivative work of GPLv2 code. So it's legal to ship.

The CDDL code is not a derivative work of GPLv2 code, but the combined work as a whole is a derivative work of GPLv2 code (assuming by "combined" we are talking about shipping an executable made by compiling and linking GPLv2 and CDDL code together). Shipping that work does require permission from both the GPLv2 code copyright owners and the CDDL code copyright owners unless the code from on or the other can be justified under fair use or if it was a part of the GPLv2 or CDDL code that is not subject to copyright.

What Canonical does is ship ZFS as a kernel module. That contains minimal GPLv2 code from the kernel that should be justifiable as fair use (which seems like a decent bet after the Oracle vs Google case).

replies(1): >>45087732 #
102. koverstreet ◴[] No.45087228{20}[source]
Oh, I'm not trying to diss ZFS at all. You and I are in complete agreement, and ZFS makes complete sense in multi device setups with real redundancy and non garbage hardware - which is what it was designed for, after all.

Just trying to give honest assessments and comparisons.

103. p_l ◴[] No.45087732{8}[source]
The derivative portions of the ZFS driver are dual-licensed GPLv2/CDDL.

The CDDL-only parts of the driver are portable between OSes, removing the "derivative code" argument (similar argumentation goes back to introduction of AFS driver for Linux, IIRC).

Remember, GPLv2 does not talk about linking. Derivativeness is decided by source code, among other things whether or not the non-GPL code can't exist/operate without GPL code.

104. cyphar ◴[] No.45089320{4}[source]
The Software Freedom Conservancy did a legal analysis and concluded that the incompatibility comes from both sides[1]. This also applies to the pre-2.0 MPL that CDDL was based on.

A lot of people focus on the fact that the CDDL allows binaries to be arbitrarily licensed so long as you provide the sources, but the issue is that the GPL requires that the source code of both combined and derived works be under the GPL and the CDDL requires that the source code be under the CDDL (i.e., the source code cannot be sublicensed). This means that (if you concluded that OpenZFS is a derived work of Linux or that it is a combined work when shipped as a kernel module) a combination may be a violation of both licenses.

However, the real question is whether a judge would look at two open source licenses that are incompatible due to a technicality and would conclude that Oracle is suffering actual harm (even though OpenZFS has decades of modifications from the Oracle version). They might also consider that Oracle themselves released DTrace (also under the CDDL) for their Linux distribution in 2012 as proof that Oracle doesn't consider it to be license violation either. If we did see Canonical get sued, maybe we'd finally be able to find out through discovery if the CDDL was intentionally designed to be GPL incompatible or not (a very contentious topic).

[1]: https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/

105. cyphar ◴[] No.45089367{10}[source]
Oracle is the license steward for CDDL, they have the right to release CDDL-2.0 and make it GPL-compatible which users would then be allowed to chose to use. Mozilla did the same thing with MPL-2.0 (CDDL was based on MPL-1.0), though the details are a little more complicated.

Unlike the GPL, the CDDL (and MPL) has an opt-out upgrade clause and all of OpenSolaris (or more accurately, almosf all software under the CDDL) can be upgraded to "CDDL-1.1 OR CDDL-2.0" unilaterally by Oracle even if they do not own the copyrights. See section 4 of the CDDL.

replies(1): >>45090676 #
106. wtallis ◴[] No.45089592{4}[source]
> with BTRFS you mount the array through one of the volume that is part of the array and not the array itself

I don't think btrfs has a concept of having only some subvolumes usable. Either you can mount the filesystem or you can't. What may have confused you is that you can mount a btrfs filesystem by referring to any individual block device that it uses, and the kernel will track down the others. But if the one device you have listed in /etc/fstab goes missing, you won't be able to mount the filesystem without fixing that issue. You can prevent the issue in the first place by identifying the filesystem by UUID instead of by an individual block device.

107. wtallis ◴[] No.45089604{6}[source]
Tiering didn't go away with the migration to all-SSD storage. It just got somewhat hidden. All consumer SSDs are doing tiered storage within the drive, using drive-specific heuristics that are completely undocumented, and host software rarely if ever makes use of features that exist to provide hints to the SSD to allow its tiering/caching to be more intelligent. In the server space, most SSDs aren't doing this kind of caching, but it's definitely not unheard-of.
108. xelxebar ◴[] No.45089747[source]
I've used btrfs for 5-ish years in the most mundane, default setup possible. However, in that time, I've had three instances of corruption across three different drives, all resulting in complete loss of the filesystem. Two of these were simply due to hard power failures, and another due to a flaky cpu.

AFAIU, btrfs effectively absolves itself of responsibility in these cases, claiming the issue is buggy drive firmware.

109. ZenoArrow ◴[] No.45090117{7}[source]
Thank you for the details, appreciate it, sounds promising.
110. p_l ◴[] No.45090676{11}[source]
0) Assuming Oracle actually retains the stewardship of license:

1) Making CDDL compatible with GPLv2 puts everyone using CDDL code at mercy of Oracle patents

2) OpenZFS is actually not required to upgrade, and the team has indicated they won't. So you end up with a fork you need to carry yourself. Might even force OpenZFS to ensure that it's specifically 1.0.

Ultimately it means Oracle can't do much with this.

replies(1): >>45091293 #
111. ThatPlayer ◴[] No.45090758{6}[source]
Yeah, for enterprise where you can have dedicated machines for single use (and $) there probably isn't much appeal. That's why I emphasized as a home user, where all my machines are running various applications.

Also for video games, where performance matters, game sizes are huge, and it's nice to have a bunch of games installed.

112. cyphar ◴[] No.45091293{12}[source]
0) They do.

1) They could just adapt MPL-2.0, which provides GPLv2+ compatibility while still providing the same patent grants.

2) The upgrade is chosen by downstream users. The OpenZFS project could ask individual contributiors to choose to license their future contributions differently but that will only affect future versions and isn't a single decision made by the project leads. I don't know in what context that discussion was in but given that the have not already opted-out of future CDDL versions kind of indicates that they can imagine future CDDL versions they would choose to upgrade to.

Also, OpenZFS is under CDDL-1.1 not 1.0.