Ollama 0.4 is released with support for Meta's Llama 3.2 Vision models locally

(ollama.com)

182 points BUFU | 5 comments | 06 Nov 24 21:10 UTC | HN request time: 0.002s | source

Show context

Patrick_Devine ◴[06 Nov 24 22:36 UTC] No.42070613[source]▶

>>42069453 (OP) #

This was a pretty heavy lift for us to get out which was why it took a while. In addition to writing new image processing routines, a vision encoder, and doing cross attention, we also ended up re-architecting the way the models get run by the scheduler. We'll have a technical blog post soon about all the stuff that ended up changing.

replies(4): >>42070644 #>>42071917 #>>42072723 #>>42076774 #

exe34 ◴[06 Nov 24 22:38 UTC] No.42070644[source]▶

>>42070613 #

did you feed back into llama.cpp?

also, can it do grounding like cogvlm?

either way, great job!

replies(1): >>42070949 #

1. Patrick_Devine ◴[06 Nov 24 23:04 UTC] No.42070949[source]▶

>>42070644 #

It's difficult because we actually ditched a lot of the c++ code with this change and rewrote it in golang. Specifically server.cpp has been excised (which was deprecated by llama.cpp anyway), and the image processing routines are all written in go as well. We also bypassed clip.cpp and wrote our own routines for the image encoder/cross attention (using GGML).

The hope is to be able to get more multimodal models out soon. I'd like to see if we can get Pixtral and Qwen2.5-vl in relatively soon.

replies(2): >>42072553 #>>42074277 #

2. qrios ◴[07 Nov 24 02:11 UTC] No.42072553[source]▶

>>42070949 (TP) #

> Specifically server.cpp has been excised (which was deprecated by llama.cpp anyway)

Is there any more specific info available about who (llama.cpp or Ollama) removed what, where? As far as I can see, the server is still part of llama.cpp.

And more generally: Is this the moment when Ollama and Llama part ways?

3. exe34 ◴[07 Nov 24 07:10 UTC] No.42074277[source]▶

>>42070949 (TP) #

that's cool thank you! no grounding then? I don't get the impression it's actually part of llama 3.2v but I thought it's worth checking with somebody who might have the experience!

replies(1): >>42079858 #

4. Patrick_Devine ◴[07 Nov 24 19:15 UTC] No.42079858[source]▶

>>42074277 #

I haven't looked at cogvlm, but if you mean doing bounding boxes w/ classification, I'd love to support models like that (like detectron2) in the future.

replies(1): >>42080375 #

5. exe34 ◴[07 Nov 24 20:06 UTC] No.42080375{3}[source]▶

>>42079858 #

I'm not sure what you mean by classification, but something like it, yes:

"what are the coordinates of the bounding box for the rubber duck in the image [img]" >>> "[10,50,200,300]"

↑