←back to thread

344 points LorenDB | 7 comments | | HN request time: 0.885s | source | bottom
1. oezi ◴[] No.44003925[source]
I wish multimodal would imply text, image and audio (+potentially video). If a model supports only image generation or image analysis, vision model seems the more appropriate term.

We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.

In this sense: Ollama's new engine adds vision support.

replies(2): >>44006219 #>>44007313 #
2. ◴[] No.44006219[source]
3. prettyblocks ◴[] No.44007313[source]
I'm very interested in working with video inputs, is it possible to do that with Qwen2.5-Omni and Ollama?
replies(3): >>44008675 #>>44009579 #>>44011015 #
4. oezi ◴[] No.44008675[source]
I have only tested Qwen2.5-Omni for audio and it was hit and miss for my use case of tagging audio.
5. machinelearning ◴[] No.44009579[source]
What's a use case are you interested in re: video?
replies(1): >>44011938 #
6. tough ◴[] No.44011015[source]
https://huggingface.co/blog/smolvlm
7. prettyblocks ◴[] No.44011938{3}[source]
I'm curious how effective these models would be at recognizing if the input video was ai generated or heavily manipulated. Also various things around face/object segmentation.