Most active commenters

Patrick_Devine(3)
exe34(3)

Popular/hot comments

>>42070613 #

Ollama 0.4 is released with support for Meta's Llama 3.2 Vision models locally

(ollama.com)

1. Patrick_Devine ◴[06 Nov 24 22:36 UTC] No.42070613[source]▶

>>42069453 (OP) #

This was a pretty heavy lift for us to get out which was why it took a while. In addition to writing new image processing routines, a vision encoder, and doing cross attention, we also ended up re-architecting the way the models get run by the scheduler. We'll have a technical blog post soon about all the stuff that ended up changing.

replies(4): >>42070644 #>>42071917 #>>42072723 #>>42076774 #

2. exe34 ◴[06 Nov 24 22:38 UTC] No.42070644[source]▶

>>42070613 #

did you feed back into llama.cpp?

also, can it do grounding like cogvlm?

either way, great job!

replies(1): >>42070949 #

3. inasring ◴[06 Nov 24 22:52 UTC] No.42070824[source]▶

>>42069453 (OP) #

Can it run the quantized models?

replies(1): >>42071506 #

4. Patrick_Devine ◴[06 Nov 24 23:04 UTC] No.42070949{3}[source]▶

>>42070644 #

It's difficult because we actually ditched a lot of the c++ code with this change and rewrote it in golang. Specifically server.cpp has been excised (which was deprecated by llama.cpp anyway), and the image processing routines are all written in go as well. We also bypassed clip.cpp and wrote our own routines for the image encoder/cross attention (using GGML).

The hope is to be able to get more multimodal models out soon. I'd like to see if we can get Pixtral and Qwen2.5-vl in relatively soon.

replies(2): >>42072553 #>>42074277 #

5. vasilipupkin ◴[06 Nov 24 23:06 UTC] No.42070973[source]▶

>>42069453 (OP) #

how likely is it to run on a reasonably new windows laptop?

replies(1): >>42071266 #

6. ac29 ◴[06 Nov 24 23:35 UTC] No.42071266[source]▶

>>42070973 #

With 16GB of RAM these vision models will run. How quickly depends on a lot of factors.

7. o11c ◴[06 Nov 24 23:53 UTC] No.42071467[source]▶

>>42069453 (OP) #

Did they fix multiline editing yet? Any interactive input that wraps across 3+ lines seems to become off-by-one when editing (but fine if you only append?), and this will be only more common with long filenames being added. And triple-quote breaks editing entirely.

How does this address the security concern of filenames being detected and read when not wanted?

8. fallingsquirrel ◴[06 Nov 24 23:58 UTC] No.42071506[source]▶

>>42070824 #

Supported quantizations: https://ollama.com/library/llama3.2-vision/tags

9. ◴[07 Nov 24 00:12 UTC] No.42071635[source]▶

>>42069453 (OP) #

10. zozbot234 ◴[07 Nov 24 00:44 UTC] No.42071917[source]▶

>>42070613 #

How long until Vulkan Compute support is merged into ollama? There is an active pull request at https://github.com/ollama/ollama/pull/5059 but it seems to be stalled with no reviews.

11. qrios ◴[07 Nov 24 02:11 UTC] No.42072553{4}[source]▶

>>42070949 #

> Specifically server.cpp has been excised (which was deprecated by llama.cpp anyway)

Is there any more specific info available about who (llama.cpp or Ollama) removed what, where? As far as I can see, the server is still part of llama.cpp.

And more generally: Is this the moment when Ollama and Llama part ways?

12. csomar ◴[07 Nov 24 02:34 UTC] No.42072723[source]▶

>>42070613 #

Any info of when we will get the 11B and 90B models?

replies(1): >>42076770 #

13. exe34 ◴[07 Nov 24 07:10 UTC] No.42074277{4}[source]▶

>>42070949 #

that's cool thank you! no grounding then? I don't get the impression it's actually part of llama 3.2v but I thought it's worth checking with somebody who might have the experience!

replies(1): >>42079858 #

14. papruapap ◴[07 Nov 24 07:47 UTC] No.42074494[source]▶

>>42069453 (OP) #

I thought llamacpp didn't support images yet, has that changed or ollama is using a different library for this?

replies(1): >>42077276 #

15. zamderax ◴[07 Nov 24 08:33 UTC] No.42074781[source]▶

>>42069453 (OP) #

Does anyone know if this will run on the iPhone 15 (6GB) or iPhone 16 (8GB)

16. ei23 ◴[07 Nov 24 08:39 UTC] No.42074823[source]▶

>>42069453 (OP) #

Is Qwen2VL supported too? Its a great vision model, works in comfyui. Llama3.2s vision seems to be super censored...

17. sgt101 ◴[07 Nov 24 11:22 UTC] No.42075703[source]▶

>>42069453 (OP) #

I tested the small model with a few images from Clevr. On first blush I am afraid it didn't do very well at all, it got object counts totally wrong and struggled to identify shapes and colours.

Still, it seems to understand what's in the images in general (cones and spheres and cubes), and the fact that it runs on my mac book at all is basically amazing.

replies(1): >>42078898 #

18. jjice ◴[07 Nov 24 14:09 UTC] No.42076770{3}[source]▶

>>42072723 #

Not sure if I'm misunderstanding, but they're live: https://ollama.com/library/llama3.2-vision

Ran the 11B yesterday and it worked great.

replies(1): >>42083795 #

19. jjice ◴[07 Nov 24 14:10 UTC] No.42076774[source]▶

>>42070613 #

Y'all did a fantastic job! This works great and to have it all right there inside of Ollama is a huge step for local model execution.

20. SCLeo ◴[07 Nov 24 15:08 UTC] No.42077276[source]▶

>>42074494 #

I believe they wrote their own image handling and did not contribute back to llama.cpp.

replies(1): >>42105882 #

21. EdwardKrayer ◴[07 Nov 24 17:37 UTC] No.42078898[source]▶

>>42075703 #

My initial testing was with charts - I've been waiting on local vision models to be good enough to feed technical documents and my initial testing is looking very good. Example:

https://i.imgur.com/1ETREP9.png

replies(1): >>42086317 #

22. Patrick_Devine ◴[07 Nov 24 19:15 UTC] No.42079858{5}[source]▶

>>42074277 #

I haven't looked at cogvlm, but if you mean doing bounding boxes w/ classification, I'd love to support models like that (like detectron2) in the future.

replies(1): >>42080375 #

23. exe34 ◴[07 Nov 24 20:06 UTC] No.42080375{6}[source]▶

>>42079858 #

I'm not sure what you mean by classification, but something like it, yes:

"what are the coordinates of the bounding box for the rubber duck in the image [img]" >>> "[10,50,200,300]"

24. csomar ◴[08 Nov 24 03:22 UTC] No.42083795{4}[source]▶

>>42076770 #

These are vision optimized, though? Or that doesn't make them perform less for coding tasks?

25. sgt101 ◴[08 Nov 24 12:24 UTC] No.42086317{3}[source]▶

>>42078898 #

I've tried with some ppt images rather than Clevr ones and it does much better. It can count circles and triangles and differentiates between them quite well. It can recognise the colours of the objects as well.

I think that the faux 3d of clevr images is too much for the model, it's interesting because much smaller pre-transformer specialist models were very good at clevr.

26. papruapap ◴[11 Nov 24 10:03 UTC] No.42105882{3}[source]▶

>>42077276 #

oh sad :(, hope they upstream it at some point.

↑