Smollm3: Smol, multilingual, long-context reasoner LLM

(huggingface.co)

347 points kashifr | 2 comments | 08 Jul 25 16:13 UTC | HN request time: 0.57s | source

Show context

gardnr ◴[08 Jul 25 16:56 UTC] No.44501814[source]▶

It's small (3B) and does great on benchmarks. This is a model for edge / mobile deployments so the gains over gemma3-4b are meaningful. It has dual mode reasoning / non_reasoning AND they released the full training method:

> We're releasing SmolLM3 with our engineering blueprint. It includes architecture details, exact data mixtures showing how we progressively boost performance across domains in a three-stage pretraining approach, and the methodology for building a hybrid reasoning model. Usually, achieving these results would require months of reverse engineering. Instead, we're providing the full methodology.

replies(1): >>44509990 #

1. sigmoid10 ◴[09 Jul 25 13:46 UTC] No.44509990[source]▶

>>44501814 #

I hate to say it, but reasoning models simply aren't suited for edge computing. I just ran some tests on this model and even at 4bit weight quantisation it blows past 10GB of VRAM with just ~1000 tokens while it is still reasoning. So even if you're running on a dedicated ML edge device like a $250 Jetson, you will run out of memory before the model even formulates a real answer. You'll need a high end GPU to make full use of it for limited answers and an enterprise grade system to support longer contexts. And with reasoning turned off I don't see any meaningful improvement over older models.

So this is primarily great for enterprises who want to do on-prem with limited budgets and maybe high-end enthusiasts.

replies(1): >>44510494 #

2. wizee ◴[09 Jul 25 14:30 UTC] No.44510494[source]▶

>>44509990 (TP) #

You should use flash attention with KV cache quantization. I routinely use Qwen 3 14B with the full 128k context and it fits in under 24 GB VRAM. On my Pixel 8, I've successfully used Qwen 3 4B with 8K context (again with flash attention and KV cache quantization).

↑