SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

(yosefk.com)

138 points shipp02 | 1 comments | 10 Jun 24 06:05 UTC | HN request time: 0.4s | source

Show context

Remnant44 ◴[11 Jun 24 18:20 UTC] No.40649679[source]▶

I think the principle things that have changed since this article was written is mostly each category taking inspiration from the other.

For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.

Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.

SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.

replies(2): >>40650596 #>>40652977 #

yosefk ◴[11 Jun 24 19:41 UTC] No.40650596[source]▶

>>40649679 #

The basic architecture still matters. GPUs still lose throughput upon divergence regardless of their increased ability to run more kinds of divergent flows correctly due to having separate PCs, and SIMD still has more trouble with instruction latency (including due to bank conflict resolution in scatter/gather) than barrel threaded machines, etc. This is not to detract from the importance of the improvements to the base architecture made over time

replies(1): >>40650660 #

1. Remnant44 ◴[11 Jun 24 19:47 UTC] No.40650660[source]▶

>>40650596 #

agreed! The basic categories remain, just blurring a bit at the edges.

↑