Most active commenters

koverstreet(8)
nullc(3)

Popular/hot comments

>>45080669 #

←back to thread

Bcachefs Goes to "Externally Maintained"

(lwn.net)

Show context

betaby ◴[30 Aug 25 17:49 UTC] No.45076609[source]▶

>>45074312 (OP) #

The sad part, that despite the years of the development BTRS never reached the parity with ZFS. And yesterday's news "Josef Bacik who is a long-time Btrfs developer and active co-maintainer alongside David Sterba is leaving Meta. Additionally, he's also stepping back from Linux kernel development as his primary job." see https://www.phoronix.com/news/Josef-Bacik-Leaves-Meta

There is no 'modern' ZFS-like fs in Linux nowadays.

replies(4): >>45076793 #>>45076833 #>>45078150 #>>45080011 #

tw04 ◴[31 Aug 25 03:05 UTC] No.45080011[source]▶

>>45076609 #

There's literally ZFS-on-linux and it works great. And yes, I will once again say Linus is completely wrong about ZFS and the multiple times he's spoken about it, it's abundantly clear he's never used it or bothered to spend any time researching its features and functionality.

https://zfsonlinux.org/

replies(5): >>45080040 #>>45080220 #>>45081040 #>>45082703 #>>45084105 #

koverstreet ◴[31 Aug 25 03:50 UTC] No.45080220[source]▶

>>45080011 #

ZFS deserves an absolutely legendary amount of respect for showing us all what a modern filesystem should be - the papers they wrote, alone, did the entire filesystem world such a massive service by demonstrating the possibilities of full data integrity and why we want it, and then they showed it could be done.

But there's a ton of room for improvement beyond what ZFS did. ZFS was a very conservative design in a lot of ways (rightly so! so many ambitious projects die because of second system syndrome); notably, it's block based and doesn't do extents - extents and snapshots are a painfully difficult combination.

Took me years to figure that one out.

My hope for bcachefs has always been to be a real successor to ZFS, with better and more flexible management, better performance, and even better robustness and reliability.

Long road, but the work continues.

replies(2): >>45080462 #>>45082319 #

1. TheAceOfHearts ◴[31 Aug 25 04:49 UTC] No.45080462[source]▶

>>45080220 #

> But there's a ton of room for improvement beyond what ZFS did.

Say more? I can't say I've really thought that much about filesystems and I'm curious in what direction you think they could be taken if time and budget weren't an issue.

replies(2): >>45080669 #>>45081484 #

2. koverstreet ◴[31 Aug 25 05:50 UTC] No.45080669[source]▶

>>45080462 (TP) #

that would be bcachefs :)

It's an entirely clean slate design, and I spent years taking my time on the core planning out the design; it's as close to perfect as I can make it.

The only things I can think of that I would change or add given unlimited time and budget: - It should be written in Rust, and even better a Rust + dependent types (which I suspect could be done with proc macros) for formal verification. And cap'n proto for on disk data structures (which still needs Rust improvements to be as ergonomic as it should be) would also be a really nice improvement.

- More hardening; the only other thing we're lacking is comprehensive fault injection testing of on disk errors. It's sufficiently battle hardened that it's not a major gap, but it really should happen at some point.

- There's more work to be done in bitrot prevention: data checksums really need to be plumbed all the way into the pagecache

I'm sure we'll keep discovering new small ways to harden, but nothing huge at this point.

Some highlights: - It has more defense in depth than any filesystem I know of. It's as close to impossible to have unrecoverable data loss as I think can really be done in a practical production filesystem - short of going full immutable/append only.

- Closest realization of "filesystem as a database" that I know of

- IO path options (replication level, compression, etc.) can be set on a per file or directory basis: I'm midway through a project extending this to do some really cool stuff, basically data management is purely declarative.

- Erasure coding is much more performant than ZFS's

- Data layout is fully dynamic, meaning you can add/remove devices at will, it just does the right thing - meaning smoother device management than ZFS

- The way the repair code works, and tracking of errors we've seen - fantastic for debugability

- Debugability and introspection are second to none: long bug hunts really aren't a thing in bcachefs development because you can just see anything the system is doing

There's still lots of work to do before we're fully at parity with ZFS. Over the next year or two I should be finishing erasure coding, online fsck, failure domains, lots more management stuff... there will always be more cool projects just over the horizon

replies(5): >>45082583 #>>45083071 #>>45083570 #>>45084005 #>>45084130 #

3. cyphar ◴[31 Aug 25 08:28 UTC] No.45081484[source]▶

>>45080462 (TP) #

You're replying to the bcachefs author, I expect his response will be fairly obvious. ;)

4. Icathian ◴[31 Aug 25 12:13 UTC] No.45082583[source]▶

>>45080669 #

I happen to work at a company that uses a ton of capnp internally and this is the first time I've seen it mentioned much outside of here. Would you mind describing what about it you think would make it a good fit for something like bcachefs?

replies(1): >>45082973 #

5. koverstreet ◴[31 Aug 25 13:21 UTC] No.45082973{3}[source]▶

>>45082583 #

Cap'n proto is basically a schema language that gets you a well defined in-memory representation that's just as good as if you were writing C structs by hand (laboriously avoiding silent padding, carefully using types with well defined sizes) - without all the silent pitfalls of doing it manually in C.

It's extremely well thought out, it's minimalist in all the right ways; I've found the features and optimizations it has to be things that are borne out of real experience that you would want end up building yourself in any real world system.

E.g. it gives you the ability to add new fields without breaking compatibility. That's the right way to approach forwards/backwards compatibility, and it's what I do in bcachefs and if we'd been able to just use cap'n proto it would've taken out a lot of manual fiddly work.

The only blocker to using it more widely in my own code is that it's not sufficiently ergonomic in Rust - Rust needs lenses, from Swift.

6. nullc ◴[31 Aug 25 13:37 UTC] No.45083071[source]▶

>>45080669 #

> - Erasure coding is much more performant than ZFS's

any plans for much lower rates than typical raid?

Increasingly modern high density devices are having block level failures at non-trivial rates instead of or in addition to whole device failures. A file might be 100,000 blocks long, adding 1000 blocks of FEC would expand it 1% but add tremendous protection against block errors. And can do so even if you have a single piece of media. Doesn't protect against device failures, sure, though without good block level protection device level protection is dicey since hitting some block level error when down to minimal devices seems inevitable and having to add more and more redundant devices is quite costly.

replies(1): >>45083444 #

7. koverstreet ◴[31 Aug 25 14:35 UTC] No.45083444{3}[source]▶

>>45083071 #

It's been talked about. I've seen some interesting work to use just a normal checksum to correct single bit errors.

If there's an optimized implementation we can use in the kernel, I'd love to add it. Even on modern hardware, we do see bit corruption in the wild, it would add real value.

replies(1): >>45083475 #

8. nullc ◴[31 Aug 25 14:39 UTC] No.45083475{4}[source]▶

>>45083444 #

It's pretty straight forward to use a normal checksum to correct single or even more bit errors (depending on the block size, choice of checksum, etc). Though I expect those bit errors are bus/ram, and hopefully usually transient. If there is corruption on the media, the whole block is usually going to be lost because any corruptions means that its internal block level FEC has more errors than it can fix.

I was more thinking along the lines of adding dozens or hundreds of correction blocks to a whole file, along the lines of par (though there are much faster techniques now).

replies(1): >>45083563 #

9. koverstreet ◴[31 Aug 25 14:48 UTC] No.45083563{5}[source]▶

>>45083475 #

You'd think that, wouldn't you? But there are enough moving parts in the IO stack below the filesystem that we do see bit errors. I don't have enough data to do correlations and tell you likely causes, but they do happen.

I think SSDs are generally worse than spinning rust (especially enterprise grade SCSI kit), the hard drive vendors have been at this a lot longer and SSDs are massively more complicated. From the conversations I've had with SSD vendors, I don't think they've put the some level of effort into making things as bulletproof as possible yet.

replies(1): >>45083695 #

10. ZenoArrow ◴[31 Aug 25 14:49 UTC] No.45083570[source]▶

>>45080669 #

> Closest realization of "filesystem as a database" that I know of

More so than BFS?

https://en.m.wikipedia.org/wiki/Be_File_System

"Like its predecessor, OFS (Old Be File System, written by Benoit Schillings - formerly BFS), it includes support for extended file attributes (metadata), with indexing and querying characteristics to provide functionality similar to that of a relational database."

replies(1): >>45083752 #

11. nullc ◴[31 Aug 25 15:02 UTC] No.45083695{6}[source]▶

>>45083563 #

One thing to keep in mind is that correction always comes as some expense of detection.

Generally a code that can always detect N errors can only always correct N/2 errors. So you detect an errored block, you correct up to N/2 errors. The block now passes but if the block actually had N errors, your correction will be incorrect and you now have silent corruption.

The solution to this is just to have an excess of error correction power and then don't use all of it. But that can be hard to do if you're trying to shoehorn it into an existing 32-bit crc.

How big are the blocks that the CRC units cover in bcachefs?

replies(1): >>45084024 #

12. koverstreet ◴[31 Aug 25 15:08 UTC] No.45083752{3}[source]▶

>>45083570 #

What BFS did is very cool, and I hope to add that to bcachefs someday.

But I'm talking more about the internals than external database functionality; the inner workings are much more fundamental.

bcachefs internally is structured more like a relational database than a traditional Unix filesystem, where everything hangs off the inode. In bcachefs, there's an extents btree (read: table), an inodes btree, a dirents btree, and a whole bunch of others - we're up to 20 (!).

There's transactions, where you can do arbitrary lookups, updates, and then commit, with all the database locking hidden from you; lookups within a transaction see uncommitted updates from that transaction. There's triggers, which are used heavily.

We don't have the full relational model - no SELECT or JOIN, no indices on arbitrary fields like with SQL (but you can do effectively the same thing with triggers, I do it all the time).

All the database/transactional primitives make the rest of the codebase much smaller and cleaner, and make feature development a lot easier than what you'd expect in other filesystems.

replies(1): >>45090117 #

13. lifty ◴[31 Aug 25 15:36 UTC] No.45084005[source]▶

>>45080669 #

Thanks for bcachefs and all the hard work you’ve put in it. It’s truly appreciated and hope you can continue to march on and not give up on the in-kernel code, even if it means bowing to Linus.

On a different note, have you heard about prolly trees and structural sharing? It’s a newer data structure that allows for very cheap structural sharing and I was wondering if it would be possible to build an FS on top of it to have a truly distributed fs that can sync across machines.

replies(1): >>45084117 #

14. koverstreet ◴[31 Aug 25 15:38 UTC] No.45084024{7}[source]▶

>>45083695 #

bcachefs checksums (and compresses) at extent granularity, not block; encoded extents (checksummed/compressed) are limited to 128k by default.

This is a really good tradeoff in practice; the vast majority of applications are doing buffered IO, not small block O_DIRECT reads - that really only comes up in benchmarks :)

And it gets us better compression ratios and better metadata overhead.

We also have quite a bit of flexibility to add something bigger to the extent for FEC, if we need to - we're not limited to a 32/64 bit checksum.

15. koverstreet ◴[31 Aug 25 15:50 UTC] No.45084117{3}[source]▶

>>45084005 #

I have not seen those...

16. m-p-3 ◴[31 Aug 25 15:52 UTC] No.45084130[source]▶

>>45080669 #

I'm saddened by this turn of event, but I hope this won't deter you from working on bcachefs on your own term and eventually see a reconciliation into the kernel at one point.

Thank you for your hard work.

17. ZenoArrow ◴[01 Sep 25 06:43 UTC] No.45090117{4}[source]▶

>>45083752 #

Thank you for the details, appreciate it, sounds promising.

↑