←back to thread

261 points fzliu | 1 comments | | HN request time: 1.281s | source
Show context
djoldman ◴[] No.42164036[source]
This is a key observation that is simple and intuitive:

>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.

replies(2): >>42169145 #>>42171855 #
1. tugdual ◴[] No.42171855[source]
They could be solving it with multimodal mixup, a technique making sure that there's no big latent gap between the two : https://arxiv.org/abs/2203.03897