I'm not able to get my agentic system to use this model though as it just says "I don't have the tools to do this". I tried modifying various agent prompts to explicitly say "Use foo tool to do bar" without any luck yet. All of the ToolSpec that I use are annotated etc. Pydantic objects and every other model has figured out how to use these tools.
I find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters because I want to know how much RAM I will have left for running other applications.
Anything below 20GB tends not to interfere with the other stuff I'm running too much. This model looks promising!
"Apple Intelligence" isn't it but it would be nice to know without churning through tests whether I should bother keeping around 2-3 models for specific tasks in ollama or if their performance is marginal there's a more stable all-rounder model.
This is still too much, a single 4090 costs $3k
What a ripoff, considering that a 5090 with 32GB of VRAM also currently costs $3k ;)
(Source: I just received the one I ordered from Newegg a week ago for $2919. I used hotstocks.io to alert me that it was available, but I wasn’t super fast at clicking and still managed to get it. Things have cooled down a lot from the craziness of early February.)
I hope not. Mine was $1700 almost 2 years go, and the 5090 is out now...
P.S. I am not a lawyer.
[0] - https://github.com/ggml-org/llama.cpp
[1] - https://lmstudio.ai/
I am hopeful that the prices will drop a bit more with Intel's recently announced Arc Pro B60 with 24GB VRAM, which unfortunately has only half the memory bandwidth of the RTX 3090.
Not sure why other hardware makers are so slow to catch up. Apple really was years ahead of the competition with the M1 Ultra with 800 GB/s memory bandwidth.
Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mistral integration.
I haven't tried it out yet but every model I've tested from Mistral has been towards the bottom of my benchmarks in a similar place to Llama.
Would be very surprised if the real life performance is anything like they're claiming.
Wouldn't mind some of my taxpayer money flowing towards apache/mit licensed models.
Even if just to maintain a baseline alternative & keep everyone honest. Seems important that we don't have some large megacorps run away with this.
There's context length, but then, how does that relate to input length and output length? Should I just make the numbers match? 32k is 32k? Any pointers?
Just for ollama, see: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...
I’m using llama.cpp though, so I can’t confirm these methods.
Interesting. I've never heard this.
To determine how much space a model needs, you look at the size of the quantized (lower precision) model on HuggingFace or wherever it's hosted. Q4_K_M is a good default. As a rough rule of thumb, this will be a little over half the size of the parameters, if they were in gigabytes. For Devstral, that's 14.3GB. You will also need 1-8GB more than that, to store the context.
For example: A 32GB Macbook Air could use Devstral at 14.3+4GB, leaving ~14GB for the system and applications. A 16GB Macbook Air could use Gemma 3 12B at 7.3+2GB, leaving ~7GB for everything else. An 8GB Macbook could use Gemma 3 4B at 2.5GB+1GB, but this is probably not worth doing.
My general impression so far is that they aren't quite up to Claude 3.7 Sonnet, but they're quite good. More than adequate for an "AI pair coding assistant", and suitable for larger architectural work as long as you break things into steps for it.
I’ve been using Cursor and I’m kind of disappointed. I get better results just going back and forth between the editor and ChatGPT
I tried localforge and aider, but they are kinda slow with local models
Try hooking aider up to gemini and see how the speed is. I have noticed that people in the localllama scene do not like to talk about their TPS.
It's kind-of like asking, for which kind of road-trip would you use a Corolla hatchback instead of a Jeep Grand Wagoneer? For me the answer would be "almost all of them", but for others that might not be the case.
However I've also ran into 2 things: 1) most models don't support tools, sometimes it's hard to find a version of the model that correctly uses tools, 2) even with good TPS, since the agents are usually doing chain-of-thought and running multiple chained prompts, the experience feels slow - this is even true with Cursor using their models/apis
But do we need 20 companies copying each other and doing the same thing?
Like, is that really competition? I'd say competition is when you do something slightly different, but I guess it's subjective based on your interpretation of what is a commodity and what is proprietary.
To my view, everyone is outright copying and creating commodity markets:
OpenAI: The OG, the Coke of Modern AI
Claude: The first copycat, The Pepsi of Modern AI
Mistral: Euro OpenAI
DeepSeek: Chinese OpenAI
Grok/xAI: Republican OpenAI
Google/MSFT: OpenAI clone as a SaaS or Office package.
Meta's Llama: Open Source OpenAI
etc...
total duration: 35.016288581s load duration: 21.790458ms prompt eval count: 1244 token(s) prompt eval duration: 1.042544115s prompt eval rate: 1193.23 tokens/s eval count: 213 token(s) eval duration: 33.94778571s eval rate: 6.27 tokens/s
total duration: 4m44.951335984s load duration: 20.528603ms prompt eval count: 1502 token(s) prompt eval duration: 773.712908ms prompt eval rate: 1941.29 tokens/s eval count: 1644 token(s) eval duration: 4m44.137923862s eval rate: 5.79 tokens/s
Compared to an API call that finishes in about 20% of the time it feels a bit slow without the recommended graphics card and what not is all I'm saying.
In terms of benchmarks, it seems unusually well tuned for the model size but I suspect its just a case of gaming the measurement by testing against it as part of the development of the model which is not bad in and of itself since I suspect every LLM who is in this space marketed to IT folks does the same thing tbh so its objective enough given that as a rough gauge of "Is this usable?" without heavy time expense testing it.
Qwen3 is a step backwards for me for example. And GLM4 is my current goto despite everyone saying it's "only good at html"
The 70b cogito model is also really good for me but doesn't get any attention.
I think it depends on our projects / languages we're using.
Still looking forward to trying this one though :)
Some AIs will be good at coding (perhaps in a particular language or ecosystem), some at analyzing information and churning out a report for you, and some will be better at operating in physical spaces.
There is no single "best" model yet, it seems.
That's on an M4 Max with 64GB of RAM. I wish I had gotten the 128GB model, though — given that I run large docker containers that consume ~24GB of my RAM, things can get tight.
The same page also gives instructions for running the model through VLLM on a GPU, but it doesn't seem like it supports quantization, so it may require multiple GPUs since the instructions say "with at least 2 GPUs".
For local LLMs Apple Silicon has really shown the value of shared memory, even if that comes at the cost of raw GPU power. Even if it's half the speed of an array of GPUs, being able to load the mid-sized models at all is a huge plus.
And ollama keeps taking it out of memory every 4 minutes.
LM studio with MLX on Mac is performing perfectly and I can keep it in my ram indefinitely.
Ollama keep alive is broken as a new rest api call resets it after. I’m surprised it’s this glitched with longer running calls and custom context length.
LM studio MLX with full 128k context.
It works well but has a long 1 minute initial prompt processing time.
I wouldn’t buy a laptop for this, I would wait for the new AMD 32gb gpu coming out.
If you want a laptop I even consider my m4 max too slow to use more than just here or there.
It melts if you run this and battery goes down asap. Have to use it docked for full speed really
What're you using for this? llama.cpp? Have a 12GB card (rtx 4070) i'd like to try it on.
I believe its just a HTTP wrapper and terminal wrapper around llama.cpp with some modifications/fork.
That's obviously not true. Ethics often have some nuance and some subjectiveness, but it's not something entirely subjective up to "politics".
Saying this makes it sound like you work at a startup for an AI powered armed drone, and your view of it is 'eh, ethics is subjective, this is fine' when asked how do you feel about responsibility and AI killing people.
Ethics are entirely subjective, as is inherently true of anything that supports "should" statements, because to justify any should statement, you need another "should" statement, you can never rest should entirely on "is" (you can, potentially, reset any entire system of "should" one root "should" axiom, though in practice most systems have more than one root axiom.)
And the process of coming to social consensus on a system of ethics is precisely politics.
You can dislike that this is true, but it is true.
> Saying this makes it sound like you work at a startup for an AI powered armed drone, and your view of it is 'eh, ethics is subjective, this is fine' when asked how do you feel about responsibility and AI killing people.
Understanding that ethics is subjective does not mean that one does not have a strong ethical framework that they adhere to. It just means that one understands the fundamental nature of ethics and the kind of propositions that ethical propositions inherently are.
Understanding that ethics are subjective does not, in other words, imply the belief that all beliefs about ethics (or, a fortiori, matters that are inherently subjective more generally) are of equal moral/ethical merit.
I tested this model with several of my Clojure problems and it is significantly worse than qwen3:30b-a3b-q4_K_M.
I don't know what to make of this. I don't trust benchmarks much anymore.
https://www.reddit.com/r/ollama/comments/1df757o/high_cost_o...
https://github.com/ollama/ollama/issues/8291
Yes.
AFAICT they usually set the default tag to sa version around 15GB.
Early reports from reddit say that it also works in cline, while other stronger coding models had issues (they were fine-tuned more towards a step-by-step chat with a user). I think this distinction is important to consider when testing.
I am currently using this model on a Macbook with 16GB ram, it is hooked up with a chrome extension that extracts text from webpages and logs to a file, then summarizes each page. I want to develop an episodic memory system, like MS Recall, but local, it does not leak my data to anyone else, and costs me nothing.
Gemma 3 4B runs under ollama and is light enough that I don't feel it while browsing. Summarization happens in the background. This page I am on is already logged and summarized.
Good luck to you mate with your life :)
It works but the tokens per sec is very slow. It did complete a TypeScript task example succinctly.
As an AI and vibe coding newbie, how does that work? E.g. how would I use devstral and ollama and instruct it to use tools? Or would I need some other program as well?
Maybe it's specialized to use just a few very specific tools? Is there some documentation on how to actually set it up without requiring some weird external platform?
Most of this is handled very easily by the ollama-python library, so you can integrate tool calling very simply in any script.
That said, this specific model was unable to call the functions and use the results in my "hello world" tests, so it seems it expects a few very specialized tools to be provided, which are defined by that platform they're advertising.
Right now the best tool calling model I've used is still qwen3, it works very reliably, and I can give it any ability I want and it'll use it when expected, even in /no_think mode.
People also use 3.5.1 to refer to 3.5(new)/3.6.
The remaining difficulty now is when people refer to 3.5, without specifying (new) or (old). I find most unspecified references to 3.5 these days are actually to 3.6 / 3.5.1 / 3.5(new), which is confusing.
Mistral's positioning as the European alternative doesn't seem to be sticking. Acquisition seems tricky given how inflection, character.ai and stability have got carved out. The big acquisition bucks are going to product companies (windsurf)
They could pivot up the stack, but then they'd be starting from scratch with a team that's ill-suited for product development.
The base model offerings from pretraining companies have been surprisingly myopic. Deepmind seems to be the only one going past the obvious "content gen/coding automation" verticals. There's a whole world out there. LLM product companies are fast acquiring pieces of the real money pie and smaller pretraining companies are getting left out.
______
edit: my comment rose to the top. It's early in the morning. Injecting a splash of optimism.
LLMs are hard, and giants like Meta are struggling to make steady progress. Mistrals models are cheap, competent, open-source-ish and don't come with AGI-is-imminent baggage. Good enough for me.
To my own question: They have a list of target industries at the top. https://mistral.ai/solutions#industry
Good luck to them.
Is it always wrong to kill people? If you say yes, then you are also saying it's wrong to defend yourself from people who are trying to kill you.
This is what I mean by subjective.
And then since Google is beholden to US laws, if the US government suddenly decides that helping Ukraine to defend itself is wrong, but you personally believe defending Ukraine is right, suddenly you have a problem...
Separately, deploying more autonomous agents that just look at an issue or such just seems premature now. We've only just gotten assisted flows kind-of working, and they still get lost--get stuck on not-hard-for-a-human debugging tasks, implement Potemkin 'fixes', forget their tools, make unrelated changes that sometimes break stuff, etc.--in ways that imply that flow isn't fully-baked yet.
Maybe the main appeal is asynchrony/potential parallelism? You could tackle that different ways, though. And SWEBench might be a good benchmark still (focus on where you want to be, even if you aren't there yet), but that doesn't mean it represents the most practical way to use these tools day-to-day currently.
Edit: I should point out that I had many other things open at the time. Mail, Safari, Messages, and more. I imagine startup would be quicker otherwise but it does mean you can run with less than 32GB.
Which model is optimized to do that? This is what I want out of LLMs! And also talking high level architecture (without any code) and library discovery, but I guess the general talking models are good for that...
"Take items from `input-ch` and group them into `batch-size` vectors. Put these onto `output-ch`. Once items
start arriving, if `batch-size` items do not arrive within `inactivity-timeout`, put the current incomplete
batch onto `output-ch`. If an anomaly is received, passes it on to `output-ch` and closes all channels. If
`input-ch` is closed, closes `output-ch`.
If `flush-predicate-fn` is provided, it will get called with two parameters: the currently accumulated
batch (guaranteed to have at least one item) and the next item. If the function returns a truthy value, the
batch will get flushed immediately.
If `convert-batch-fn` is provided, it will get called with the currently accumulated batch (guaranteed to
have at least one item) and its return value will be put onto `output-ch`. Anomalies bypass
`convert-batch-fn` and get put directly onto `output-ch` (which gets closed immediately afterwards)."
In other words, not obvious.I ask the model to review the code and tell me if there are improvements that can be made. Big (online) models can do a pretty good job with the floating point equality function, and suggest something at least in the ballpark for the async code. Small models rarely get everything right, but some of their observations are good.
Don't they supposedly have to have the item in Amazon's warehouse to sell it?