There is no 'modern' ZFS-like fs in Linux nowadays.
There is no 'modern' ZFS-like fs in Linux nowadays.
But there's a ton of room for improvement beyond what ZFS did. ZFS was a very conservative design in a lot of ways (rightly so! so many ambitious projects die because of second system syndrome); notably, it's block based and doesn't do extents - extents and snapshots are a painfully difficult combination.
Took me years to figure that one out.
My hope for bcachefs has always been to be a real successor to ZFS, with better and more flexible management, better performance, and even better robustness and reliability.
Long road, but the work continues.
Say more? I can't say I've really thought that much about filesystems and I'm curious in what direction you think they could be taken if time and budget weren't an issue.
It's an entirely clean slate design, and I spent years taking my time on the core planning out the design; it's as close to perfect as I can make it.
The only things I can think of that I would change or add given unlimited time and budget: - It should be written in Rust, and even better a Rust + dependent types (which I suspect could be done with proc macros) for formal verification. And cap'n proto for on disk data structures (which still needs Rust improvements to be as ergonomic as it should be) would also be a really nice improvement.
- More hardening; the only other thing we're lacking is comprehensive fault injection testing of on disk errors. It's sufficiently battle hardened that it's not a major gap, but it really should happen at some point.
- There's more work to be done in bitrot prevention: data checksums really need to be plumbed all the way into the pagecache
I'm sure we'll keep discovering new small ways to harden, but nothing huge at this point.
Some highlights: - It has more defense in depth than any filesystem I know of. It's as close to impossible to have unrecoverable data loss as I think can really be done in a practical production filesystem - short of going full immutable/append only.
- Closest realization of "filesystem as a database" that I know of
- IO path options (replication level, compression, etc.) can be set on a per file or directory basis: I'm midway through a project extending this to do some really cool stuff, basically data management is purely declarative.
- Erasure coding is much more performant than ZFS's
- Data layout is fully dynamic, meaning you can add/remove devices at will, it just does the right thing - meaning smoother device management than ZFS
- The way the repair code works, and tracking of errors we've seen - fantastic for debugability
- Debugability and introspection are second to none: long bug hunts really aren't a thing in bcachefs development because you can just see anything the system is doing
There's still lots of work to do before we're fully at parity with ZFS. Over the next year or two I should be finishing erasure coding, online fsck, failure domains, lots more management stuff... there will always be more cool projects just over the horizon
I will not use or recommend ZFS on _any_ OS until they solve the double page cache problem. A filesystem has no business running its own damned page cache that duplicates the OS one. I don't give a damn if ZFS has a fancy eviction algorithm. ARC's patent is expired. Go port it to mainline Linux if it's not that good. Just don't make inner platform.
I obviously have nothing like inside knowledge, but I assume the reason there have not been lawsuits over this, is that whoever could bring one (would it be only Oracle?) expects there are even-odds that they would lose? Thus the risk of setting an adverse precedent isn't worth the damages they might be awarded from suing Canonical?
But Sun ensured that they can only gnash their teeth.
The source of "license incompatibility" btw is the same as from using GPLv3 code in kernel - CDDL adds an extra restriction in form of patent protections (just like Apache 2)
It's extremely well thought out, it's minimalist in all the right ways; I've found the features and optimizations it has to be things that are borne out of real experience that you would want end up building yourself in any real world system.
E.g. it gives you the ability to add new fields without breaking compatibility. That's the right way to approach forwards/backwards compatibility, and it's what I do in bcachefs and if we'd been able to just use cap'n proto it would've taken out a lot of manual fiddly work.
The only blocker to using it more widely in my own code is that it's not sufficiently ergonomic in Rust - Rust needs lenses, from Swift.
any plans for much lower rates than typical raid?
Increasingly modern high density devices are having block level failures at non-trivial rates instead of or in addition to whole device failures. A file might be 100,000 blocks long, adding 1000 blocks of FEC would expand it 1% but add tremendous protection against block errors. And can do so even if you have a single piece of media. Doesn't protect against device failures, sure, though without good block level protection device level protection is dicey since hitting some block level error when down to minimal devices seems inevitable and having to add more and more redundant devices is quite costly.
I am not a lawyer.
My understanding is that Canonical is shipping ZFS with Ubuntu. Or do I misunderstand? Has Canonical not actually done the big, bad thing of distributing the Linux kernel with ZFS? Did they find some clever just-so workaround so as to technically not be violation of the Linux kernel's license terms?
Otherwise, if Canonical has actually done the big, bad thing, who has standing to bring suit? Would the Linux Foundation sue Canonical, or would Oracle?
I ask this in all humility, and I suspect there is a chance that my questions are nonsense and I don't know enough to know why.
If there's an optimized implementation we can use in the kernel, I'd love to add it. Even on modern hardware, we do see bit corruption in the wild, it would add real value.
I was more thinking along the lines of adding dozens or hundreds of correction blocks to a whole file, along the lines of par (though there are much faster techniques now).
I think SSDs are generally worse than spinning rust (especially enterprise grade SCSI kit), the hard drive vendors have been at this a lot longer and SSDs are massively more complicated. From the conversations I've had with SSD vendors, I don't think they've put the some level of effort into making things as bulletproof as possible yet.
More so than BFS?
https://en.m.wikipedia.org/wiki/Be_File_System
"Like its predecessor, OFS (Old Be File System, written by Benoit Schillings - formerly BFS), it includes support for extended file attributes (metadata), with indexing and querying characteristics to provide functionality similar to that of a relational database."
Generally a code that can always detect N errors can only always correct N/2 errors. So you detect an errored block, you correct up to N/2 errors. The block now passes but if the block actually had N errors, your correction will be incorrect and you now have silent corruption.
The solution to this is just to have an excess of error correction power and then don't use all of it. But that can be hard to do if you're trying to shoehorn it into an existing 32-bit crc.
How big are the blocks that the CRC units cover in bcachefs?
But I'm talking more about the internals than external database functionality; the inner workings are much more fundamental.
bcachefs internally is structured more like a relational database than a traditional Unix filesystem, where everything hangs off the inode. In bcachefs, there's an extents btree (read: table), an inodes btree, a dirents btree, and a whole bunch of others - we're up to 20 (!).
There's transactions, where you can do arbitrary lookups, updates, and then commit, with all the database locking hidden from you; lookups within a transaction see uncommitted updates from that transaction. There's triggers, which are used heavily.
We don't have the full relational model - no SELECT or JOIN, no indices on arbitrary fields like with SQL (but you can do effectively the same thing with triggers, I do it all the time).
All the database/transactional primitives make the rest of the codebase much smaller and cleaner, and make feature development a lot easier than what you'd expect in other filesystems.
On a different note, have you heard about prolly trees and structural sharing? It’s a newer data structure that allows for very cheap structural sharing and I was wondering if it would be possible to build an FS on top of it to have a truly distributed fs that can sync across machines.
This is a really good tradeoff in practice; the vast majority of applications are doing buffered IO, not small block O_DIRECT reads - that really only comes up in benchmarks :)
And it gets us better compression ratios and better metadata overhead.
We also have quite a bit of flexibility to add something bigger to the extent for FEC, if we need to - we're not limited to a 32/64 bit checksum.
Hopefull we can get to a point where Linux has a native, and first-class modern alternative to ZFS with BcacheFS.
Yes. Oracle have that copyright.
That's the whole fucking point.
Anything from before the fork is still licensed (and pretty much everything after) is still under the CDDL which is possibly in conflict with the GPL.
It would be an interesting lawsuit as the judge might well ask why as copyright holder of ZFS they can't solve the problem they are suing over. But I think you underestimate the deviousness of oracle's legal dept.
The CDDL being unacceptable is the same issue that GPL3 or Apache is unacceptable - unlike GPLv2, CDDL mandates patent licensing as far as the code is considered.
Additionally, GPLv2 does not prevent shipping ZFS combined with GPL code, because CDDL code is not derivative work of GPLv2 code. So it's legal to ship.
It could be problematic to upstream, because kernel development would demand streamlining to the point that the code would be derivative.
Additionally, two or three kernel contributors decided that the long standing consensus on derivative work is not correct and sued Canonical. So far nothing happened out of that, Los Alamos National Laboratory also laughed it off.
Venue shopping being what it is, though...
* read-only and minimal
* fully aware of different Linux boot environments
* GPLv3 license compatible, clean-room implementation by the OpenSolaris/Illumos team. The implementation predates Ubuntu’s interest.
The CDDL code is not a derivative work of GPLv2 code, but the combined work as a whole is a derivative work of GPLv2 code (assuming by "combined" we are talking about shipping an executable made by compiling and linking GPLv2 and CDDL code together). Shipping that work does require permission from both the GPLv2 code copyright owners and the CDDL code copyright owners unless the code from on or the other can be justified under fair use or if it was a part of the GPLv2 or CDDL code that is not subject to copyright.
What Canonical does is ship ZFS as a kernel module. That contains minimal GPLv2 code from the kernel that should be justifiable as fair use (which seems like a decent bet after the Oracle vs Google case).
The CDDL-only parts of the driver are portable between OSes, removing the "derivative code" argument (similar argumentation goes back to introduction of AFS driver for Linux, IIRC).
Remember, GPLv2 does not talk about linking. Derivativeness is decided by source code, among other things whether or not the non-GPL code can't exist/operate without GPL code.
A lot of people focus on the fact that the CDDL allows binaries to be arbitrarily licensed so long as you provide the sources, but the issue is that the GPL requires that the source code of both combined and derived works be under the GPL and the CDDL requires that the source code be under the CDDL (i.e., the source code cannot be sublicensed). This means that (if you concluded that OpenZFS is a derived work of Linux or that it is a combined work when shipped as a kernel module) a combination may be a violation of both licenses.
However, the real question is whether a judge would look at two open source licenses that are incompatible due to a technicality and would conclude that Oracle is suffering actual harm (even though OpenZFS has decades of modifications from the Oracle version). They might also consider that Oracle themselves released DTrace (also under the CDDL) for their Linux distribution in 2012 as proof that Oracle doesn't consider it to be license violation either. If we did see Canonical get sued, maybe we'd finally be able to find out through discovery if the CDDL was intentionally designed to be GPL incompatible or not (a very contentious topic).
[1]: https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/
Unlike the GPL, the CDDL (and MPL) has an opt-out upgrade clause and all of OpenSolaris (or more accurately, almosf all software under the CDDL) can be upgraded to "CDDL-1.1 OR CDDL-2.0" unilaterally by Oracle even if they do not own the copyrights. See section 4 of the CDDL.
1) Making CDDL compatible with GPLv2 puts everyone using CDDL code at mercy of Oracle patents
2) OpenZFS is actually not required to upgrade, and the team has indicated they won't. So you end up with a fork you need to carry yourself. Might even force OpenZFS to ensure that it's specifically 1.0.
Ultimately it means Oracle can't do much with this.
1) They could just adapt MPL-2.0, which provides GPLv2+ compatibility while still providing the same patent grants.
2) The upgrade is chosen by downstream users. The OpenZFS project could ask individual contributiors to choose to license their future contributions differently but that will only affect future versions and isn't a single decision made by the project leads. I don't know in what context that discussion was in but given that the have not already opted-out of future CDDL versions kind of indicates that they can imagine future CDDL versions they would choose to upgrade to.
Also, OpenZFS is under CDDL-1.1 not 1.0.