They mention it’s 3x faster when turning collision off. I don’t know what the memory footprint of a block is, but I’d speculate that small round particles (sphere plus radius) are an order of magnitude faster.
Modern GPUs are insanely fast. A higher end consumer GPU like a 5090 can do over 100 teraflops of fp32 computation if your cache is perfectly utilized and memory access isn’t the bottleneck. Normally, memory is the bottleneck, and at a minimum you need to read and write your particles every frame of a sim, which is why the sibling comments are using memory bandwidth to estimate the number of particles per second. I’d guess that if you were only adverting particles without collision, or colliding against only a small number of big objects (like the particles collide against the planet and not each other) then you could move multiple billions of particles per second, which you would might divide by your desired frame rate to see how many particles per frame you can do.