This type of parallelism is sort of like a flops metric. Optimizing the amount of wall time the GPU is actually doing computation is just as important (if not more). There are some synchronization and pipelining tools in CUDA and Vulkan but they are scary at first glance.