Where "supporting" a model doesn't mean what you think it means for cpp
Between that and the long saga with vision models having only partial support, with a CLI tool, and no llama-server support (they only fixed all that very recently) the fact of the matter is that ollama is moving faster and implementing what people want before lama.cpp now
And it will finally shut down all the people who kept copy pasting the same criticism of ollama "it's just a llama.cpp wrapper why are you not using cpp instead"
Presumably Ollama had been working on this for quite a while already - it sounds like they've broken their initial dependency on llama.cpp. Being in charge of their own destiny makes a lot of sense.
I'd hoped to see this mentioned in TFA, but it kind of acts like multimodal is totally new to Ollama, which it isn't.
I don't fully understand Ollama's timeline and strategy yet.
Went with my own wrapper around llama.cpp and stable-diffusion.cpp with optional prompting hosted if I don’t like the result so much, but it makes a good start for hosted to improve on.
Also obfuscates any requests sent to hosted, cause why feed them insight to my use case when I just want to double check algorithmic choices of local AI? The ground truth relationship func names and variable names imply is my little secret
llama.cpp did have multimodal, I've been maintaining an integration for many moons now. (Feb 2024? Original LLaVa through Gemma 3)
However, this was not for mere mortals. It was not documented and had gotten unwieldy, to say the least.
ngxson (HF employee) did a ton of work to get gemma3 support in, and had to do it in a separate binary. They dove in and landed a refactored backbone that is presumably more maintainable and on track to be in what I think of as the real Ollama, llama.cpp's server binary.
As you well note, Ollama is Ollamaing - I joked, once, that the median llama.cpp contribution from Ollama is a driveby GitHub comment asking when a feature will land in llama-server, so it can be copy-pasted into Ollama.
It's really sort of depressing to me because I'm just one dude, it really wasn't that hard to support it (it's one of a gajillion things I have to do, I'd estimate 2 SWE-weeks at 10 YOE, 1.5 SWE-days for every model release), and it's hard to get attention for detailed work in this space with how much everyone exaggerates and rushes to PR.
EDIT: Coming back after reading the blog post, and I'm 10x as frustrated. "Support thinking / reasoning; Tool calling with streaming responses" --- this is table stakes stuff that was possible eons ago.
I don't see any sign of them doing anything specific in any of the code they link, the whole thing reads like someone carefully worked with an LLM to present a maximalist technical-sounding version of the llama.cpp stuff and frame it as if they worked with these companies and built their own thing. (note the very careful wording on this, e.g. in the footer the companies are thanked for releasing the models)
I think it's great that they have a nice UX that helps people run llama.cpp locally without compiling, but it's hard for me to think of a project I've been more by turned off by in my 37 years on this rock.
Ollama appears to not properly credit llama.cpp: https://github.com/ollama/ollama/issues/3185 - this is a long-standing issue that hasn't been addressed.
This seems to have leaked into other projects where even when llama.cpp is being used directly, it's being credited to Ollama: https://github.com/ggml-org/llama.cpp/pull/12896
Ollama doesn't contributed to upstream (that's fine, they're not obligated to), but it's a bit weird that one of the devs claimed to have and uh, not really: https://www.reddit.com/r/LocalLLaMA/comments/1k4m3az/here_is... - that being said they seem to maintain their own fork so anyone could cherry pick stuff it they wanted to: https://github.com/ollama/ollama/commits/main/llama/llama.cp...
It'd be like if handbrake tried to pretend that they implemented all the video processing work, when it's dependent on libffmpeg for all of that.
Other than being a nice wrapper around llama.cpp, are there any meaningful improvements that they came up with that landed in llama.cpp?
I guess in this case with the introduction of libmtmd (for multi-modal support in llama.cpp) Ollama waited and did a git pull and now multi-modal + better vision support was here and no proper credit was given.
Yes, they had vision support via LLaVa models but it wasn't that great.
Well it's even sillier than that: I didn't realize that the timeline in the llama.cpp link was humble and matched my memory: it was the test binaries that changed. i.e. the API was refactored a bit and such but its not anything new under the sun. Also the llama.cpp they have has tool and thinking support. shrugs
The tooling was called llava but that's just because it was the first model -- multimodal models are/were consistently supported ~instantly, it was just your calls into llama.cpp needed to manage that,a nd they still do! - its just there's been some cleanup so there isn't one test binary for every model.
It's sillier than that in it wasn't even "multi-modal + better vision support was here" it was "oh we should do that fr if llama.cpp is"
On a more positive note, the big contributor I appreciate in that vein is Kobold contributed a ton of Vulkan work IIUC.
And another round of applause for ochafik: idk if this gentleman from Google is doing this in his spare time or fulltime for Google, but they have done an absolutely stunning amount of work to make tool calls and thinking systematically approachable, even building a header-only Jinja parser implementation and designing a way to systematize "blessed" overrides of the rushed silly templates that are inserted into models. Really important work IMHO, tool calls are what make AI automated and having open source being able to step up here significantly means you can have legit Sonnet-like agency in Gemma 3 12B, even Phi 4 3.8B to an extent.
In the early days of Docker, we had the debate of Docker vs LXC. At the time, Docker was mostly a wrapper over LXC and people were dismissing the great user experience improvements of Docker.
I agree however that the lack of acknowledgement to llama.cpp for a long time has been problematic. They acknowledge the project now.
[0]: https://github.com/ollama/ollama/blob/main/docs/modelfile.md
You have to support Vulkan if you care about consumer hardware. Ollama devs clearly don't.
Ollama is written in golang so of course they can not meaningfully contribute that back to llama.cpp.
Where do you imagine ggml is from?
> The llama.cpp project is the main playground for developing new features for the ggml library
-> https://github.com/ollama/ollama/tree/27da2cddc514208f4e2353...
(Hint: If you think they only write go in ollama, look at the commit history of that folder)
Ollama does, please try it.
See SimonW post here:
https://simonwillison.net/2025/May/10/llama-cpp-vision/
>If I understood it correctly
You understood it exactly like they wanted you to...
Ollama makes this trivial compared to llama.cpp, and so for me adds a lot of value due to this.
We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.
In this sense: Ollama's new engine adds vision support.
End result for users like me though, is to have to duplicate +30GB large files just because I wanted to use the weights in Ollama and the rest of the ecosystem. So instead I use everything else that largely just works the same way, and not Ollama.
They are plainly going to capture the market, and switch to some "enterprise license" that lets them charge $, on the backs of other peoples work.
Question: what are cool and useful multi modal projects have people here built using local models?
I am looking for personal project ideas.
I'm not clear what they are called (or how implemented) — but perhaps 1) the initial prompt/context (that, for example, Grok has got in trouble with recently) and 2) the kind of saved context that allows ChatGPT to know things about your prompt-history so it can better answer future queries.
(My use of ollama has been pretty bare-bones and I have not seen anything covering these higher level features in -help.)
> Ollama is written in golang so of course they can not meaningfully contribute that back to llama.cpp.
llama.cpp consumes GGML.
ollama consumes GGML.
If they contribute upstream changes, they are contributing to llama.cpp.
The assertions that they:
a) only write golang
b) cannot upstream changes
Are both, categorically, false.
You can argue what 'meaningfully' means if you like. You can also believe whatever you like.
However, both (a) and (b), are false. It is not a matter of dispute.
> Whatever gotcha you think there is, exists only in your head.
There is no 'gotcha'. You're projecting. My only point is that any claim that they are somehow not able to contribute upstream changes only indicates a lack of desire or competence, not a lack of the technical capacity to do so.
I believe it keeps the model loaded across sessions, and possibly keeps the KV cache warm for ongoing sessions (but I doubt it, based on the API shape; I don't see a "session" parameter), but that's about it. Nothing seems to be written to disk.
Features like ChatGPT's "memories" or cross-chat context require a persistence layer that's probably best suited for a "frontend". Ollama's API does support passing in requests with history, for example: https://github.com/ollama/ollama/blob/main/docs/api.md#chat-...
Why shouldn't I go with llama.cpp, lmstudio, or ramalama (containers/RH); I will at least know what I am getting with each one.
Ramalama actually contributes quite a bit back to llama.cpp/whipser.cpp (more projects probably), while delivering a solution that works better for me.
https://github.com/ollama/ollama/pull/9650 https://github.com/ollama/ollama/pull/5059
ggml != llama.cpp, but llama.cpp and Ollama are both using ggml as a library.
After my attempt, I think chat is performant enough on my M1. Code gen was too slow for me. Image generation was 1-2 minutes for small pixel art sprites, which for my use case is fine to let churn for a while, but the image generation results were much worse than ChatGPT browser gives me out of the box. I do not know if poor image quality is due to machine constraints or me not understanding how to configure the checkpoint and models.
I would be interested to hear how an M3 or M4 Mini handles these things as those are fair affordable to pick up used.
But practically, I believe that Ollama just doesn't have a concept of server-side persistent state at the moment to even do such a thing.
I’m sure there’s more to the prompt and what to do with this newly generated messages array, but the gist is there.
If this is the case, an Ollama implementation shouldn’t be too difficult.
What is actually written: Top: 家和国盛 Left: 和谐生活人人舒畅迎新春 Right: 平安社会家家欢乐辞旧岁
What Ollama saw: Top: 盛和家国 (correct characters but wrong order) Left: It reads "新春" (new spring) as 舒畅 (comfortable) Right: 家家欢欢乐乐辞旧岁 (duplicates characters and omits the first four)
“Some of the development is currently happening in the llama.cpp and whisper.cpp repos” --https://github.com/ggml-org/ggml
The English translation, I thought was pretty spot on. We don't hide the mistakes of the models or fake the demos.
Overtime, of course I hope the models to improve much more
Based on this part:
> We set out to support a new engine that makes multimodal models first-class citizens, and getting Ollama’s partners to contribute more directly the community - the GGML tensor library.
And from clicking through a github link they had:
https://github.com/ollama/ollama/blob/main/model/models/gemm...
My takeaway is, the GGML library (the thing that is the backbone for llama.cpp) must expose some FFI (foreign function interface) that can be invoked from Go, so in the ollama Go code, they can write their own implementations of model behavior (like Gemma 3) that just calls into the GGML magic. I think I have that right? I would have expected a detail like that to be front and center in the blog post.
We can see the weakness of this argument given it is unlikely any front-end is written in C, and then noting it is unlikely ~0 people contribute to llama.cpp.
"The best way to get to Stanford University from the Ferry Building in San Francisco depends on your preferences and budget. Here are a few options:
1. *By Car*: Take US-101 South to CA-85 South, then continue on CA-101 South."
CA 85 is significantly farther down 101 than Palo Alto.
What they cannot meaningfully do is write Go code that solves their problems and upstream those changes to llama.cpp.
The former requires they are comfortable writing C++, something perhaps not all Go devs are.
(it's also worth looking at the code linked for the model-specific impls, this isn't exactly 1000s of lines of complicated code. To wit, while they're working with Georgi...why not offer to help land it in llama.cpp?)