We talked about this model in some depth on the last Pretrained episode:
https://youtu.be/5weFerGhO84?si=Eh_92_9PPKyiTU_h&t=1743
Some interesting takeaways imo:
- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)
- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM
- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.