←back to thread

621 points sebg | 2 comments | | HN request time: 0s | source
Show context
randomtoast ◴[] No.43717002[source]
Why not use CephFS instead? It has been thoroughly tested in real-world scenarios and has demonstrated reliability even at petabyte scale. As an open-source solution, it can run on the fastest NVMe storage, achieving very high IOPS with 10 Gigabit or faster interconnect.

I think their "Other distributed filesystem" section does not answer this question.

replies(4): >>43717453 #>>43717925 #>>43719471 #>>43721116 #
charleshn ◴[] No.43719471[source]
Because it's actually fairly slow.

Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.

For that you need zero-copy, RDMA etc.

Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.

Achieving such performance would also require many changes to the client (RDMA, etc).

Something like Weka [2] has a much better design for this kind of performance.

[0] https://ceph.io/en/news/crimson/

[1] https://seastar.io/

[2] https://www.weka.io/

replies(2): >>43720342 #>>43725696 #
__turbobrew__ ◴[] No.43720342[source]
With latest ceph releases I am able to saturate modern NVME devices with 2 OSD/NVME. It is kind of a hack to have multiple OSD per NVME, but it works.

I do agree that nvme-of is the next hurdle for ceph performance.

replies(1): >>43728082 #
1. sgarland ◴[] No.43728082[source]
I thought the current recommendation was to not have multiple OSDs per NVMe? Tbf I haven’t looked in a while.

I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.

replies(1): >>43729277 #
2. __turbobrew__ ◴[] No.43729277[source]
https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

I benchmarked 1 vs 2 OSD and found 2 OSD was better. I don’t think it is recommended to run more than 2 OSD per NVME.