SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

Show context

narrowbyte ◴[11 Jun 24 16:37 UTC] No.40648378[source]▶

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

replies(3): >>40648477 #>>40648581 #>>40656815 #

raphlinus ◴[11 Jun 24 16:51 UTC] No.40648581[source]▶

>>40648378 #

One of the other major things that's changed is that Nvidia now has independent thread scheduling (as of Volta, see [1]). That allows things like individual threads to take locks, which is a pretty big leap. Essentially, it allows you to program each individual thread as if it's running a C++ program, but of course you do have to think about the warp and block structure if you want to optimize performance.

I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...

replies(3): >>40649270 #>>40650486 #>>40651532 #

xoranth ◴[11 Jun 24 21:10 UTC] No.40651532{3}[source]▶

>>40648581 #

> That allows things like individual threads to take locks, which is a pretty big leap.

Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?

replies(1): >>40652174 #

raphlinus ◴[11 Jun 24 22:07 UTC] No.40652174{4}[source]▶

>>40651532 #

There's a bit more information at [1], but I think the details are not public. The hardware is tracking a separate program counter (and call stack) for each thread. So in the CAS example, one thread wins and continues making progress, while the other threads loop.

There seems to some more detail in a Bachelors thesis by Phillip Grote[2], with lots of measurements of different synchronization primitives, but it doesn't go too deep into the hardware.

[1]: https://arxiv.org/abs/2205.11659

[2]: https://www.clemenslutz.com/pdfs/bsc_thesis_phillip_grote.pd...

replies(1): >>40652400 #