←back to thread

621 points sebg | 1 comments | | HN request time: 0s | source
Show context
jamesblonde ◴[] No.43716889[source]
Architecturally, it is a scale-out metadata filesystem [ref]. Other related distributed file systems are Collosus, Tectonic (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think PolarFS (Alibaba). They all use different distributed row-oriented DBs for storing metadata. S3FS uses FoundationDB, Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not sure), HopsFS uses RonDB.

What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easiter - and (2) NVMe storage - so that training pipelines aren't Disk I/O bound (you can't always split files small enough and parallel reading/writing enough to a S3 object store).

Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe for recent data and S3 for archival.

[ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...

replies(5): >>43716985 #>>43717053 #>>43717220 #>>43719689 #>>43720601 #
threeseed ◴[] No.43717053[source]
Tiered storage and FUSE has existed with Alluxio for years.

And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).

None of it is particularly ground breaking just a lot of pieces brought together.

replies(1): >>43717195 #
jamesblonde ◴[] No.43717195[source]
The difference is scale-out metadata in the filesystem. Alluxio uses Raft, i believe, for metadata - that has to fit on a single server.
replies(1): >>43717580 #
rfoo ◴[] No.43717580[source]
3FS isn't particularly fast in mdbench, though. Maybe our FDB tuning skill is what to blame, or FUSE, I don't know, but it doesn't really matter.

The truly amazing part for me is combining NVMe SSD + RDMA + supports reading a huge batch of random offsets from a few already opened huge files efficiently. This is how you get your training boxes consuming 20~30GiB/s (and roughly 4 million IOPS).

replies(1): >>43718921 #
rjzzleep ◴[] No.43718921[source]
FUSE has traditionally been famously slow. I remember there were some changes that supposedly made it faster, but maybe that was just a certain fuse implementation.
replies(1): >>43719340 #
1. jamesblonde ◴[] No.43719340[source]
The block size is 4KB by default, which is a killer. We set it to 1MB or so by default - makes a huge difference.

https://github.com/logicalclocks/hopsfs-go-mount