Thank you for sharing such an interesting work.
A little comment: adding some more aggressive optimization optimization options to simd C++ code to see the performance difference.
On my side with a AMD Ryzen 9 7900X3D CPU, I have
- 0.0592569 ms for `-O3 -march=native` option, and - 1.7741e-05 ms for `-funsafe-math-optimizations -Ofast -flto=auto -pipe -march=native`
replies(1):