Pekka Paalanen wrote a nice blogpost about the concept of repaint scheduling with graphs:
https://ppaalanen.blogspot.com/2015/02/weston-repaint-schedu... (note that the Weston examples gives a whopping 7ms for composition).
I'm making some assumptions about your chart as it is not to scale, but it looks like the usual worst-case strategy. Given a 60Hz refresh rate and a 1ms composition time an example of an optimized composition strategy would look something like this:
+0ms vblank, frame#-1 starts scanout
+15.4ms read cursor position #0, initiate composite #0
+16.4ms composition buffer #0 ready
+16.5ms update cursor plane position #0 and attach primary plane buffer #0
+16.6ms vblank, frame #0 starts scanout
+32.1ms read cursor position, initiate composite #1
+33.1ms composition buffer #1 ready
+33.2ms update cursor position and attach primary plane buffer #1
+33.3ms vblank, frame #1 starts scanout
In this case, both the composite and the cursor position is only 1.2ms old at the time the GPU starts scanning it out, and hardware vs. software cursor has no effect on latency. Moving the cursor update closer would make the cursor out of sync with the displayed content, which is not really worth it.
(Games and other fullscreen applications can have their render buffer directly scanned out to remove the composition delay and read input at their own pace for simulation reasons, and those applications tend to be the subject at hand when discussing single or sub-millisecond input latency optimizations.)
> Frames are extremely fast to render, but they arrive the frame after they were originally scheduled, because GPU pipelines are asynchronous.
The display block is synchronous. While render pipelines are asynchronous, that is not a problem - as long as the render task completes before the scanout deadline, the resulting buffer can be included in that immediate scanout. Synchronization primitives are also there when you need it, and high-priority and compute queues can be used if you are concerned that the composition task ends up delayed by other things.
Also note that the scanout deadline is entirely virtual - the display block honors whatever framebuffer you point a plane to at any point, we just try to only do that during vblank to avoid tearing.
> If you actually try any of the tests I mentioned in my original comment you'll see this for yourself.
While it might be fun to see if Microsoft screwed up their composition and paint scheduling, that does not change that it is not related to GPUs or the graphics stack itself. Working in the Linux display server space makes me quite comfortable in my understanding of GPU's display controllers.