An intro to DeepSeek's distributed file system

1. jamesblonde ◴[17 Apr 25 13:54 UTC] No.43716889[source]▶

Architecturally, it is a scale-out metadata filesystem [ref]. Other related distributed file systems are Collosus, Tectonic (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think PolarFS (Alibaba). They all use different distributed row-oriented DBs for storing metadata. S3FS uses FoundationDB, Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not sure), HopsFS uses RonDB.

What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easiter - and (2) NVMe storage - so that training pipelines aren't Disk I/O bound (you can't always split files small enough and parallel reading/writing enough to a S3 object store).

Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe for recent data and S3 for archival.

[ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...

replies(5): >>43716985 #>>43717053 #>>43717220 #>>43719689 #>>43720601 #

2. nickfixit ◴[17 Apr 25 13:59 UTC] No.43716985[source]▶

>>43716889 (TP) #

I've been using JuiceFS since the start for my AI stacks. Similar and used postgresql for the meta.

replies(1): >>43717041 #

3. jamesblonde ◴[17 Apr 25 14:04 UTC] No.43717041[source]▶

>>43716985 #

JuiceFS is very good. I didn't have it as a scaleout metadata FS - it supports lots of DBs (single host and distributed DBs).

4. threeseed ◴[17 Apr 25 14:05 UTC] No.43717053[source]▶

>>43716889 (TP) #

Tiered storage and FUSE has existed with Alluxio for years.

And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).

None of it is particularly ground breaking just a lot of pieces brought together.

replies(1): >>43717195 #

5. jamesblonde ◴[17 Apr 25 14:15 UTC] No.43717195[source]▶

>>43717053 #

The difference is scale-out metadata in the filesystem. Alluxio uses Raft, i believe, for metadata - that has to fit on a single server.

replies(1): >>43717580 #

6. objectivefs ◴[17 Apr 25 14:17 UTC] No.43717220[source]▶

>>43716889 (TP) #

There is also ObjectiveFS that supports FUSE and uses S3 for both data and metadata storage, so there is no need to run any metadata nodes. Using S3 instead of a separate database also allows scaling both data and metadata with the performance of the S3 object store.

replies(1): >>43726938 #

7. rfoo ◴[17 Apr 25 14:40 UTC] No.43717580{3}[source]▶

>>43717195 #

3FS isn't particularly fast in mdbench, though. Maybe our FDB tuning skill is what to blame, or FUSE, I don't know, but it doesn't really matter.

The truly amazing part for me is combining NVMe SSD + RDMA + supports reading a huge batch of random offsets from a few already opened huge files efficiently. This is how you get your training boxes consuming 20~30GiB/s (and roughly 4 million IOPS).

replies(1): >>43718921 #

8. rjzzleep ◴[17 Apr 25 16:15 UTC] No.43718921{4}[source]▶

>>43717580 #

FUSE has traditionally been famously slow. I remember there were some changes that supposedly made it faster, but maybe that was just a certain fuse implementation.

replies(1): >>43719340 #

9. jamesblonde ◴[17 Apr 25 16:47 UTC] No.43719340{5}[source]▶

>>43718921 #

The block size is 4KB by default, which is a killer. We set it to 1MB or so by default - makes a huge difference.

https://github.com/logicalclocks/hopsfs-go-mount

10. joatmon-snoo ◴[17 Apr 25 17:14 UTC] No.43719689[source]▶

>>43716889 (TP) #

nit: Colossus* for Google.

11. MertsA ◴[17 Apr 25 18:39 UTC] No.43720601[source]▶

>>43716889 (TP) #

>Tectonic some KV store,

Tectonic is built on ZippyDB which is a distributed DB built on RocksDB.

>What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easier

Tectonic also has a FUSE client built for GenAI workloads on clusters backed by 100% NVMe storage.

https://engineering.fb.com/2024/03/12/data-center-engineerin...

Personally what stands out to me for 3FS isn't just that it has a FUSE client, but that they made it more of a hybrid of FUSE client and native IO path. You open the file just like normal but once you have a fd you use their native library to do the actual IO. You still need to adapt whatever AI training code to use 3FS natively if you want to avoid FUSE overhead, but now you use your FUSE client for all the metadata operations that the native client would have needed to implement.

https://github.com/deepseek-ai/3FS/blob/ee9a5cee0a85c64f4797...

replies(1): >>43723212 #

12. Scaevolus ◴[17 Apr 25 23:12 UTC] No.43723212[source]▶

>>43720601 #

Being able to opt-in to the more complex and efficient user-mode IO path for critical use cases is a very good idea.

replies(1): >>43724347 #

13. carlhjerpe ◴[18 Apr 25 02:25 UTC] No.43724347{3}[source]▶

>>43723212 #

While not the same, Ceph storage is accessible as object storage, filesystem (both FUSE and kernel) and block storage.

14. halifaxbeard ◴[18 Apr 25 11:14 UTC] No.43726938[source]▶

>>43717220 #

OFS was a drop in replacement for EFS and tbh it's insanely good value for the problem space.