I wish multimodal would imply text, image and audio (+potentially video). If a model supports only image generation or image analysis, vision model seems the more appropriate term.
We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.
In this sense: Ollama's new engine adds vision support.
replies(2):