> I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep".
Ok I see, that definitely not what I understood from my study of the Nvidia SIMT uarch.
And yes I will claim that "the instruction can be executed in multiple passes with different masks depending on which arguments are available" (using your words).
> So the operand collector provides a limited reordering capability to maximize hardware utilization, right?
Yes, that my understanding, and that's why I claim it's different from "classical" SIMD
> What is the benefit as opposed to stalling and executing the instruction only when all arguments are available?
That's a good question, note that: I think Apple GPU uarch do not work like the Nvidia one, my understanding is that Apple uarch is way closer to a classical SIMD unit. So it's definitely not killer to move form the original SIMT uarch from Nvidia.
That said, a think the SIMT uarch from Nvidia is way more flexible, and better maximize hardware utilization (executing instruction as soon as possible always help for better utilization).
And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core. With a classical SIMD uarch it may be possible but you need extra hardware to handle warp execution overlapping, and even more hardware to enable overlapping more that 2 threads.
Also, Nvidia's operand-collector allow to emulate multi-ported register-file, this probably help with register sharing.
There is actually multiple patent from Nvidia about non-trivial register allocation within the register-file banks, depending on how the register will be used to minimize conflict.
> Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)?
It's not obvious what would mean "superscalar" in an SIMT context.
For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.
With SIMT most of the instruction parallelism is very explicit (with thread parallelism), so it's not really "extracted" (and not from the same thread).
But anyway, if you question is either multiple instructions from a single warp can be executed in parallel (across different threads), then a would say probably yes for Nvidia (not sure, there is very few information available..), at least 2 instructions from the same thread block (from the same program, but different warp) should be able to be executed in parallel.
> I think this is essentially what some architectures describe as the "register file cache"
I'm not sure about that, there is actually some published papers (and probably some patents) from Nvidia about register-file cache for SIMT uarch. And that come after the operand-collector patent.
But in the end it really depend what concept you are referring to with "register-file cache".
In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.