An intro to DeepSeek's distributed file system

(maknee.github.io)

Show context

randomtoast ◴[17 Apr 25 14:01 UTC] No.43717002[source]▶

Why not use CephFS instead? It has been thoroughly tested in real-world scenarios and has demonstrated reliability even at petabyte scale. As an open-source solution, it can run on the fastest NVMe storage, achieving very high IOPS with 10 Gigabit or faster interconnect.

I think their "Other distributed filesystem" section does not answer this question.

replies(4): >>43717453 #>>43717925 #>>43719471 #>>43721116 #

1. charleshn ◴[17 Apr 25 16:57 UTC] No.43719471[source]▶

>>43717002 #

Because it's actually fairly slow.

Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.

For that you need zero-copy, RDMA etc.

Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.

Achieving such performance would also require many changes to the client (RDMA, etc).

Something like Weka [2] has a much better design for this kind of performance.

[0] https://ceph.io/en/news/crimson/

[1] https://seastar.io/

[2] https://www.weka.io/

replies(2): >>43720342 #>>43725696 #

2. __turbobrew__ ◴[17 Apr 25 18:11 UTC] No.43720342[source]▶

>>43719471 (TP) #

With latest ceph releases I am able to saturate modern NVME devices with 2 OSD/NVME. It is kind of a hack to have multiple OSD per NVME, but it works.

I do agree that nvme-of is the next hurdle for ceph performance.

replies(1): >>43728082 #

3. rthnbgrredf ◴[18 Apr 25 06:57 UTC] No.43725696[source]▶

>>43719471 (TP) #

> Because it's actually fairly slow.

"We were reading data at 635 GiB/s. We broke 15 million 4k random read IOPS."

Source: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.

replies(1): >>43726966 #

4. rfoo ◴[18 Apr 25 11:19 UTC] No.43726966[source]▶

>>43725696 #

> I think 15M random read IOPS is actually quite fast

680x NVMe SSDs over 68 storage servers (so 68 CPUs) for just 15M (or 25M, tuned) random read IOPS is pretty underwhelming. The use cases where 3FS (or some other custom designs) shine are more like, 200M random read IOPS with 64 servers each with 8 PCIe gen 4 NVMe SSDs (512x SSDs in total).

5. sgarland ◴[18 Apr 25 13:48 UTC] No.43728082[source]▶

>>43720342 #

I thought the current recommendation was to not have multiple OSDs per NVMe? Tbf I haven’t looked in a while.

I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.

replies(1): >>43729277 #

6. __turbobrew__ ◴[18 Apr 25 15:58 UTC] No.43729277{3}[source]▶

>>43728082 #

https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

I benchmarked 1 vs 2 OSD and found 2 OSD was better. I don’t think it is recommended to run more than 2 OSD per NVME.

↑