SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

(yosefk.com)

138 points shipp02 | 1 comments | 10 Jun 24 06:05 UTC | HN request time: 0s | source

Show context

narrowbyte ◴[11 Jun 24 16:37 UTC] No.40648378[source]▶

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

replies(3): >>40648477 #>>40648581 #>>40656815 #

raphlinus ◴[11 Jun 24 16:51 UTC] No.40648581[source]▶

>>40648378 #

One of the other major things that's changed is that Nvidia now has independent thread scheduling (as of Volta, see [1]). That allows things like individual threads to take locks, which is a pretty big leap. Essentially, it allows you to program each individual thread as if it's running a C++ program, but of course you do have to think about the warp and block structure if you want to optimize performance.

I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...

replies(3): >>40649270 #>>40650486 #>>40651532 #

1. yosefk ◴[11 Jun 24 17:43 UTC] No.40649270[source]▶

>>40648581 #

It's improtant that GPU threads support locking and control flow divergence and I don't want to minimize that, but threads within a warp diverging still badly loses throughput, so I don't think the situation I'd fundamentally different in terms of what the machine is good/bad at. We're just closer to the base architecture's local maximum of capabilities, as one would expect for a more mature architecture; various things it could be made to support it now actually supports because there was time to add this support

↑