Ollama's new engine for multimodal models

(ollama.com)

1. oezi ◴[16 May 25 11:10 UTC] No.44003925[source]▶

I wish multimodal would imply text, image and audio (+potentially video). If a model supports only image generation or image analysis, vision model seems the more appropriate term.

We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.

In this sense: Ollama's new engine adds vision support.

replies(2): >>44006219 #>>44007313 #

2. ◴[16 May 25 14:53 UTC] No.44006219[source]▶

>>44003925 (TP) #

3. prettyblocks ◴[16 May 25 16:28 UTC] No.44007313[source]▶

>>44003925 (TP) #

I'm very interested in working with video inputs, is it possible to do that with Qwen2.5-Omni and Ollama?

replies(3): >>44008675 #>>44009579 #>>44011015 #

4. oezi ◴[16 May 25 18:49 UTC] No.44008675[source]▶

>>44007313 #

I have only tested Qwen2.5-Omni for audio and it was hit and miss for my use case of tagging audio.

5. machinelearning ◴[16 May 25 20:35 UTC] No.44009579[source]▶

>>44007313 #

What's a use case are you interested in re: video?

replies(1): >>44011938 #

6. tough ◴[17 May 25 00:24 UTC] No.44011015[source]▶

>>44007313 #

https://huggingface.co/blog/smolvlm

7. prettyblocks ◴[17 May 25 03:57 UTC] No.44011938{3}[source]▶

>>44009579 #

I'm curious how effective these models would be at recognizing if the input video was ai generated or heavily manipulated. Also various things around face/object segmentation.

↑