This quote is important, but in isolation it's not clear that they are claiming to have beat this problem: they are saying the new model, voyage-multimodal-3 instead identifies linked concepts across modalities. That would indeed be pretty cool -- if there is a latent space that could cluster the same idea, represented visually or in text.
> ... the vectors truly capture the semantic content contained in the screenshots. This robustness is due to the model’s unique approach of processing all input modalities through the same backbone.
With that said, I think this benchmark is a pretty narrow way of thinking about multi-modal embedding. Having text embed close to images of related text is cool and convenient, but doesn't necessarily extend to other notions of related visual expression (e.g. "rabbit" vs a photo of a rabbit). And on the narrow goal of indexing document images, I suspect there are other techniques that could also work quite well.
This seems like a great opportunity for a new benchmark dataset with multi-modal concept representations beyond media-of-text.