> The earlier you sample the cursor position and update the cursor plane, the more the position is out of date once the next scanout comes around, increasing the perceived input delay.
No, the cursor position is more up-to-date than the rest of the screen because it doesn't need to wait for a GPU pipeline to finish after it's moved.
> Unless Microsoft is doing something weird, this should be extremely fast. <1ms fast.
Look, I'm saying this is what's going on. (not to scale)
... | vsync ...
... | cursor updated for frame 0 ...
... | frame 0 scanout ...
... | frame 1 ready ...
... | vsync ...
... | cursor updated for frame 1 ...
... | frame 1 scanout ...
... | frame 2 ready ...
Frames are extremely fast to render, but they arrive the frame
after they were originally scheduled, because GPU pipelines are asynchronous. However, the cursor position arrives immediately because the position of the hardware layer can be synchronously updated immediately before scanout. The effect is that updates to the cursor position are (essentially) displayed 1 frame sooner than updates to the rest of the screen. If you actually try any of the tests I mentioned in my original comment you'll see this for yourself.