I think their "Other distributed filesystem" section does not answer this question.
I think their "Other distributed filesystem" section does not answer this question.
Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.
For that you need zero-copy, RDMA etc.
Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.
Achieving such performance would also require many changes to the client (RDMA, etc).
Something like Weka [2] has a much better design for this kind of performance.
I do agree that nvme-of is the next hurdle for ceph performance.
I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.
I benchmarked 1 vs 2 OSD and found 2 OSD was better. I don’t think it is recommended to run more than 2 OSD per NVME.