I think their "Other distributed filesystem" section does not answer this question.
I think their "Other distributed filesystem" section does not answer this question.
Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.
For that you need zero-copy, RDMA etc.
Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.
Achieving such performance would also require many changes to the client (RDMA, etc).
Something like Weka [2] has a much better design for this kind of performance.
I do agree that nvme-of is the next hurdle for ceph performance.
"We were reading data at 635 GiB/s. We broke 15 million 4k random read IOPS."
Source: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.
680x NVMe SSDs over 68 storage servers (so 68 CPUs) for just 15M (or 25M, tuned) random read IOPS is pretty underwhelming. The use cases where 3FS (or some other custom designs) shine are more like, 200M random read IOPS with 64 servers each with 8 PCIe gen 4 NVMe SSDs (512x SSDs in total).
I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.
I benchmarked 1 vs 2 OSD and found 2 OSD was better. I don’t think it is recommended to run more than 2 OSD per NVME.