Ollama's new engine for multimodal models

(ollama.com)

353 points LorenDB | 2 comments | 16 May 25 01:43 UTC | HN request time: 0.522s | source

Show context

simonw ◴[16 May 25 04:38 UTC] No.44001886[source]▶

The timing on this is a little surprising given llama.cpp just finally got a (hopefully) stable vision feature merged into main: https://simonwillison.net/2025/May/10/llama-cpp-vision/

Presumably Ollama had been working on this for quite a while already - it sounds like they've broken their initial dependency on llama.cpp. Being in charge of their own destiny makes a lot of sense.

replies(1): >>44001924 #

lolinder ◴[16 May 25 04:48 UTC] No.44001924[source]▶

>>44001886 #

Do you know what exactly the difference is with either of these projects adding multimodal support? Both have supported LLaVA for a long time. Did that require special casing that is no longer required?

I'd hoped to see this mentioned in TFA, but it kind of acts like multimodal is totally new to Ollama, which it isn't.

replies(2): >>44001952 #>>44002109 #

refulgentis ◴[16 May 25 05:31 UTC] No.44002109[source]▶

>>44001924 #

It's a turducken of crap from everyone but ngxson and Hugging Face and llama.cpp in this situation.

llama.cpp did have multimodal, I've been maintaining an integration for many moons now. (Feb 2024? Original LLaVa through Gemma 3)

However, this was not for mere mortals. It was not documented and had gotten unwieldy, to say the least.

ngxson (HF employee) did a ton of work to get gemma3 support in, and had to do it in a separate binary. They dove in and landed a refactored backbone that is presumably more maintainable and on track to be in what I think of as the real Ollama, llama.cpp's server binary.

As you well note, Ollama is Ollamaing - I joked, once, that the median llama.cpp contribution from Ollama is a driveby GitHub comment asking when a feature will land in llama-server, so it can be copy-pasted into Ollama.

It's really sort of depressing to me because I'm just one dude, it really wasn't that hard to support it (it's one of a gajillion things I have to do, I'd estimate 2 SWE-weeks at 10 YOE, 1.5 SWE-days for every model release), and it's hard to get attention for detailed work in this space with how much everyone exaggerates and rushes to PR.

EDIT: Coming back after reading the blog post, and I'm 10x as frustrated. "Support thinking / reasoning; Tool calling with streaming responses" --- this is table stakes stuff that was possible eons ago.

I don't see any sign of them doing anything specific in any of the code they link, the whole thing reads like someone carefully worked with an LLM to present a maximalist technical-sounding version of the llama.cpp stuff and frame it as if they worked with these companies and built their own thing. (note the very careful wording on this, e.g. in the footer the companies are thanked for releasing the models)

I think it's great that they have a nice UX that helps people run llama.cpp locally without compiling, but it's hard for me to think of a project I've been more by turned off by in my 37 years on this rock.

replies(3): >>44002251 #>>44002410 #>>44002628 #

nolist_policy ◴[16 May 25 07:21 UTC] No.44002628[source]▶

>>44002109 #

For one Ollama supports interleaved sliding window attention for Gemma 3 while llama.cpp doesn't.[0] iSWA reduces kv cache size to 1/6.

Ollama is written in golang so of course they can not meaningfully contribute that back to llama.cpp.

[0] https://github.com/ggml-org/llama.cpp/issues/12637

replies(2): >>44002699 #>>44008326 #

refulgentis ◴[16 May 25 18:13 UTC] No.44008326[source]▶

>>44002628 #

It's impossible to meaningfully contribute to the C library you call from Go because you're calling it from Go? :)

We can see the weakness of this argument given it is unlikely any front-end is written in C, and then noting it is unlikely ~0 people contribute to llama.cpp.

replies(1): >>44009809 #

magicalhippo ◴[16 May 25 21:03 UTC] No.44009809[source]▶

>>44008326 #

They can of course meaningfully contribute new C++ code to llama.cpp, which they then could later use downstream in Go.

What they cannot meaningfully do is write Go code that solves their problems and upstream those changes to llama.cpp.

The former requires they are comfortable writing C++, something perhaps not all Go devs are.

replies(1): >>44010943 #

refulgentis ◴[17 May 25 00:12 UTC] No.44010943[source]▶

>>44009809 #

I'd love to be able to take this into account, step back, and say "Ah yes - there is non-zero probability they are materially incapable of contributing back to their dependency" - in practice, if you're comfortable writing SWA in Go, you're going to be comfortable writing it in C++, and they are writing C++ already.

(it's also worth looking at the code linked for the model-specific impls, this isn't exactly 1000s of lines of complicated code. To wit, while they're working with Georgi...why not offer to help land it in llama.cpp?)

replies(1): >>44011647 #

magicalhippo ◴[17 May 25 02:44 UTC] No.44011647[source]▶

>>44010943 #

Perhaps for SWA.

For the multimodal stuff it's a lot clear cut. Ollama used the image processing libraries from Go, while in llama.cpp they ended up rolling their own image processing routines.

replies(1): >>44011701 #

refulgentis ◴[17 May 25 02:57 UTC] No.44011701[source]▶

>>44011647 #

Citation?

My groundbreaking implementation passes it RGB bytes, passes em through the image projector, and put the tokens in the prompt.

And I cannot imagine sure why the inference engine would be more concerned with it than that.

Is my implementation a groundbreaking achievement worth rendering llama.cpp a footnote, because I use Dart image-processing libraries?

replies(1): >>44011801 #

1. magicalhippo ◴[17 May 25 03:24 UTC] No.44011801[source]▶

>>44011701 #

> Citation?

https://github.com/ollama/ollama/issues/7300#issuecomment-24...

https://github.com/ggml-org/llama.cpp/blob/3e0be1cacef290c99...

Anyway my point was just that it's not as easy as just pushing a patch upstream, like it is in many other projects. It would require a new or different implementation.

replies(1): >>44011995 #

2. refulgentis ◴[17 May 25 04:13 UTC] No.44011995[source]▶

>>44011801 (TP) #

I see, they can't figure out how to contribute a few lines of C++ because we have a link where someone says they can't figure out how to contribute C++ code only Go. :)

There's a couple things I want to impart: #1) empathy is important. One comment about one feature from maybe an ollama core team member doesn't mean people are rushing to waste their time and look mean calling them out for poor behavior. #2) half formed thought: something of what we might call the devil lives in a common behavior pattern that I have to resist myself: rushing in, with weak arguments, to excuse poor behavior. Sometimes I act as if litigating one instance of it, and finding a rationale for it in that instance, makes their behavior pattern reasonable.

Riffing, an analogy someone else made is particularly adept: ollama is to llama.cpp as handbrake is to ffmpeg. I cut my teeth on C++ via handbrake almost 2 decades ago, and we wouldn't be caught dead acting this way. At the very least for fear of embarrassment. What I didnt anticipate is that people will make contrarian arguments on your behalf no matter what you do.

↑