This was a pretty heavy lift for us to get out which was why it took a while. In addition to writing new image processing routines, a vision encoder, and doing cross attention, we also ended up re-architecting the way the models get run by the scheduler. We'll have a technical blog post soon about all the stuff that ended up changing.
replies(3):