All-in-one embedding model for interleaved text, images, and screenshots

1. FergusArgyll ◴[17 Nov 24 12:00 UTC] No.42163723[source]▶

I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini

  Until now, the standard approach to creating multimodal models involved 
  training separate components for different modalities and then stitching them 
  together to roughly mimic some of this functionality. These models can 
  sometimes be good at performing certain tasks, like describing images, but  
  struggle with more conceptual and complex reasoning.

  We designed Gemini to be natively multimodal, pre-trained from the start on 
  different modalities. Then we fine-tuned it with additional multimodal data to 
  further refine its effectiveness. This helps Gemini seamlessly understand and 
  reason about all kinds of inputs from the ground up, far better than existing 
  multimodal models — and its capabilities are state of the art in nearly every 
  domain.

replies(3): >>42163807 #>>42165329 #>>42167478 #

2. aabhay ◴[17 Nov 24 12:19 UTC] No.42163807[source]▶

>>42163723 (TP) #

LLM embedding contain super positions of many concepts so while they might predict the next token they don’t actually out perform contrastively pretrained embedding models.

3. fzliu ◴[17 Nov 24 17:09 UTC] No.42165329[source]▶

>>42163723 (TP) #

Because LLMs such as Gemini -- and other causal language models more broadly -- are trained on next token prediction, the vectors that you get from pooling the output token embeddings aren't that useful for RAG or semantic search compared to what you get from actual embedding models.

One distinction to make here is that token embeddings and the embeddings/vectors that are output from embedding models are related but separate concepts. There are numerous token embeddings (one per token) which become contextualized as they propagate through the transformer, while there is a single vector/embedding that is output by embedding models (one per input data, such as long text, photo, or document screenshot).

4. refulgentis ◴[17 Nov 24 21:30 UTC] No.42167478[source]▶

>>42163723 (TP) #

Fwiw if the other replies aren't clear: change "embeddings" to "List<double> that some layer of my AI model produces" (that's not exactly correct, it's slightly more specific than that, but in this context it's correct)

LLMs, including multimodal LLMs, do have embeddings, but they're embeddings learned by generating text, instead of finding similar documents