An intro to DeepSeek's distributed file system

(maknee.github.io)

623 points sebg | 1 comments | 17 Apr 25 12:50 UTC | HN request time: 0s | source

Show context

jamesblonde ◴[17 Apr 25 13:54 UTC] No.43716889[source]▶

Architecturally, it is a scale-out metadata filesystem [ref]. Other related distributed file systems are Collosus, Tectonic (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think PolarFS (Alibaba). They all use different distributed row-oriented DBs for storing metadata. S3FS uses FoundationDB, Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not sure), HopsFS uses RonDB.

What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easiter - and (2) NVMe storage - so that training pipelines aren't Disk I/O bound (you can't always split files small enough and parallel reading/writing enough to a S3 object store).

Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe for recent data and S3 for archival.

[ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...

replies(5): >>43716985 #>>43717053 #>>43717220 #>>43719689 #>>43720601 #

threeseed ◴[17 Apr 25 14:05 UTC] No.43717053[source]▶

>>43716889 #

Tiered storage and FUSE has existed with Alluxio for years.

And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).

None of it is particularly ground breaking just a lot of pieces brought together.

replies(1): >>43717195 #

jamesblonde ◴[17 Apr 25 14:15 UTC] No.43717195[source]▶

>>43717053 #

The difference is scale-out metadata in the filesystem. Alluxio uses Raft, i believe, for metadata - that has to fit on a single server.

replies(1): >>43717580 #

rfoo ◴[17 Apr 25 14:40 UTC] No.43717580[source]▶

>>43717195 #

3FS isn't particularly fast in mdbench, though. Maybe our FDB tuning skill is what to blame, or FUSE, I don't know, but it doesn't really matter.

The truly amazing part for me is combining NVMe SSD + RDMA + supports reading a huge batch of random offsets from a few already opened huge files efficiently. This is how you get your training boxes consuming 20~30GiB/s (and roughly 4 million IOPS).

replies(1): >>43718921 #

rjzzleep ◴[17 Apr 25 16:15 UTC] No.43718921[source]▶

>>43717580 #

FUSE has traditionally been famously slow. I remember there were some changes that supposedly made it faster, but maybe that was just a certain fuse implementation.

replies(1): >>43719340 #

1. jamesblonde ◴[17 Apr 25 16:47 UTC] No.43719340[source]▶

>>43718921 #

The block size is 4KB by default, which is a killer. We set it to 1MB or so by default - makes a huge difference.

https://github.com/logicalclocks/hopsfs-go-mount

↑