7 GB/s at 512 KB block size is only ~14,000 IO/s which is a whopping ~70 us/IO. That is a trivial rate for even synchronous IO. You should only need one inflight operation (prefetch 1) to overlap your memory copy (to avoid serializing the IO with the memory copy) to get the full IO bandwidth.
Their referenced previous post [1] demonstrates ~240,000 IO/s when using basic settings. Even that seems pretty low, but is still more than enough to completely trivialize this benchmark and saturate the hardware IO with zero tuning.
replies(1):