> low latency results each frame
What does that do to your utilization?
I've been out of this space for a while, but in game dev any backwards information flow (GPU->CPU) completely murdered performance. "Whatever you do, don't stall the pipeline." Instant 50%-90% performance hit. Even if you had to awkwardly duplicate calculations on the CPU, it was almost always worth it, and not by a small amount. The caveat to "almost" was that if you were willing to wait 2-4 frames to get data back, you could do that without stalling the pipeline.
I didn't think this was a memory architecture thing, I thought it was a data dependency thing. If you have to finish all calculations before readback and if you have to readback before starting new calculations, the time for all cores to empty out and fill back up is guaranteed to be dead time, regardless of whether the job was rendering polygons or finite element calculations or neural nets.
Does shared memory actually change this somehow? Or does it just make it more convenient to shoot yourself in the foot?
EDIT: or is the difference that HPC operates in a regime where long "frame time" dwarfs the pipeline empty/refill "dead time"?