SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

(yosefk.com)

138 points shipp02 | 2 comments | 10 Jun 24 06:05 UTC | HN request time: 0.541s | source

Show context

Remnant44 ◴[11 Jun 24 18:20 UTC] No.40649679[source]▶

I think the principle things that have changed since this article was written is mostly each category taking inspiration from the other.

For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.

Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.

SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.

replies(2): >>40650596 #>>40652977 #

1. anonymoushn ◴[11 Jun 24 23:45 UTC] No.40652977[source]▶

>>40649679 #

After primarily using AVX2, I don't think masked instructions and scatter/gather are particularly useful. Emulating masked computations with a blend is cheap. Emulating compress and some missing shuffles is expensive. Masked stores and loads don't really help with anything except for an edge case where they don't cause page faults on the part that was masked out.

replies(1): >>40663275 #

2. petermcneeley ◴[12 Jun 24 21:23 UTC] No.40663275[source]▶

>>40652977 (TP) #

On the gpu a masked out load is a nop. It certainly is better. And scatter functionality is probably quite painful to emulate without the intrinsics.

↑