This part:
The one thing I’m left scratching my head over is the length field. If I have 0x20 bytes of image data to send over, I actually need to put 0x10 into that field.
Made me think the protocol simply assumes at least 2 bytes will always be used, so it transmits the length using the unit of 16-bit "words" instead of bytes. That would not be unheard of, and is kind of smart even.