Another big chunk of this likely happened when they hardened the graphics subsystem for security. Win32 user calls are unbelievably expensive nowadays. SendMessage etc. have a ton of overhead.
Another chunk is likely the sheer number of expensive DLLs that need to be loaded and initialized with most apps. For example, IIRC, the moment you load COM or WinSock DLLs, your app stops loading snappily. Pretty much anything will load COM even without intending to.
Another chunk is IMM - the ctfmon process you love, for multi-language/keyboard support. ImmDisable(0) can make loading a bit snappier, but then good luck with keyboard switching and the like. It uses window hooks, which are slow Win32 calls as mentioned.
People think it's just a matter of writing plain Win32, but that's not the whole story, although it certainly helps compared to more heavyweight frameworks.