SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

(yosefk.com)

138 points shipp02 | 1 comments | 10 Jun 24 06:05 UTC | HN request time: 0s | source

Show context

narrowbyte ◴[11 Jun 24 16:37 UTC] No.40648378[source]▶

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

replies(3): >>40648477 #>>40648581 #>>40656815 #

majke ◴[11 Jun 24 16:44 UTC] No.40648477[source]▶

>>40648378 #

Last time i looked at intel scatter/gather I got the impression it only works for a very narrow use case, and getting it to perform wasn’t easy. Did I miss something?

replies(2): >>40648681 #>>40655793 #

narrowbyte ◴[11 Jun 24 16:57 UTC] No.40648681[source]▶

>>40648477 #

The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."

I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.

replies(1): >>40649219 #

1. yosefk ◴[11 Jun 24 17:39 UTC] No.40649219[source]▶

>>40648681 #

Barrel threaded machines like GPUs have easier time hiding the latency of bank conflict resolution when gathering/scattering against local memory/cache than a machine running a single instruction thread. So pretty sure they have a fundamental advantage when it comes to the throughput of scatter/gather operations that gets bigger with a larger number of vector lanes

↑