SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

1. narrowbyte ◴[11 Jun 24 16:37 UTC] No.40648378[source]▶

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

replies(3): >>40648477 #>>40648581 #>>40656815 #

2. majke ◴[11 Jun 24 16:44 UTC] No.40648477[source]▶

>>40648378 (TP) #

Last time i looked at intel scatter/gather I got the impression it only works for a very narrow use case, and getting it to perform wasn’t easy. Did I miss something?

replies(2): >>40648681 #>>40655793 #

3. raphlinus ◴[11 Jun 24 16:51 UTC] No.40648581[source]▶

>>40648378 (TP) #

One of the other major things that's changed is that Nvidia now has independent thread scheduling (as of Volta, see [1]). That allows things like individual threads to take locks, which is a pretty big leap. Essentially, it allows you to program each individual thread as if it's running a C++ program, but of course you do have to think about the warp and block structure if you want to optimize performance.

I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...

replies(3): >>40649270 #>>40650486 #>>40651532 #

4. narrowbyte ◴[11 Jun 24 16:57 UTC] No.40648681[source]▶

>>40648477 #

The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."

I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.

replies(1): >>40649219 #

5. yosefk ◴[11 Jun 24 17:39 UTC] No.40649219{3}[source]▶

>>40648681 #

Barrel threaded machines like GPUs have easier time hiding the latency of bank conflict resolution when gathering/scattering against local memory/cache than a machine running a single instruction thread. So pretty sure they have a fundamental advantage when it comes to the throughput of scatter/gather operations that gets bigger with a larger number of vector lanes

6. yosefk ◴[11 Jun 24 17:43 UTC] No.40649270[source]▶

>>40648581 #

It's improtant that GPU threads support locking and control flow divergence and I don't want to minimize that, but threads within a warp diverging still badly loses throughput, so I don't think the situation I'd fundamentally different in terms of what the machine is good/bad at. We're just closer to the base architecture's local maximum of capabilities, as one would expect for a more mature architecture; various things it could be made to support it now actually supports because there was time to add this support

7. narrowbyte ◴[11 Jun 24 19:32 UTC] No.40650486[source]▶

>>40648581 #

I intentionally said "more towards embarrassingly parallel" rather than "only embarrassingly parallel". I don't think there's a hard cutoff, but there is a qualitative difference. One example that springs to mind is https://github.com/simdjson/simdjson - afaik there's no similarly mature GPU-based JSON parsing.

replies(1): >>40651010 #

8. raphlinus ◴[11 Jun 24 20:19 UTC] No.40651010{3}[source]▶

>>40650486 #

I'm not aware of any similarly mature GPU-based JSON parser, but I believe such a thing is possible. My stack monoid work [1] contains a bunch of ideas that may be helpful for building one. I've thought about pursuing that, but have kept focus on 2D graphics as it's clearer how that will actually be useful.

[1]: https://arxiv.org/abs/2205.11659

9. xoranth ◴[11 Jun 24 21:10 UTC] No.40651532[source]▶

>>40648581 #

> That allows things like individual threads to take locks, which is a pretty big leap.

Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?

replies(1): >>40652174 #

10. raphlinus ◴[11 Jun 24 22:07 UTC] No.40652174{3}[source]▶

>>40651532 #

There's a bit more information at [1], but I think the details are not public. The hardware is tracking a separate program counter (and call stack) for each thread. So in the CAS example, one thread wins and continues making progress, while the other threads loop.

There seems to some more detail in a Bachelors thesis by Phillip Grote[2], with lots of measurements of different synchronization primitives, but it doesn't go too deep into the hardware.

[1]: https://arxiv.org/abs/2205.11659

[2]: https://www.clemenslutz.com/pdfs/bsc_thesis_phillip_grote.pd...

replies(1): >>40652400 #

11. xoranth ◴[11 Jun 24 22:31 UTC] No.40652400{4}[source]▶

>>40652174 #

Thanks!

12. majke ◴[12 Jun 24 08:20 UTC] No.40655793[source]▶

>>40648477 #

vpgatherdd - I think that for newer CPUs it is faster than many loads + inserts, but if you are going to fault a lot, then it becomes slow.

> The VGATHER instructions are implemented as micro-coded flow. Latency is ~50 cycles.

https://www.intel.com/content/www/us/en/content-details/8141...

13. ribit ◴[12 Jun 24 11:25 UTC] No.40656815[source]▶

>>40648378 (TP) #

Modern GPUs are exposing the SIMD behind the SIMT model and heavily investing into SIMD features such as shuffles, votes, and reduces. This leads to an interesting programming model. One interesting challenge is that flow control is done very differently on different hardware. AMD has a separate scalar instruction pipeline which can set the SIMD mask. Apple uses an interesting per-lane stack counter approach where value of zero means that the lane is active and non-zero value indicates how many blocks need to be exited for the thread to become active again. Not really sure how Nvidia does it.