CPUs are surprisingly good at dealing with this in their store queues. I see this write-all-and-increment-some technique used a lot in optimized code, like branchless left-pack routines or overcopying in the copy handler of an LZ/Deflate decompressor.