Smollm3: Smol, multilingual, long-context reasoner LLM

(huggingface.co)

343 points kashifr | 2 comments | 08 Jul 25 16:13 UTC | HN request time: 0.449s | source

Show context

simonw ◴[09 Jul 25 00:43 UTC] No.44505302[source]▶

I'm having trouble running this on my Mac - I've tried Ollama and llama.cpp llama-server so far, both using GGUFs from Hugging Face, but neither worked.

(llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'smollm3')

I've managed to run it using Python and transformers with PyTorch in device="cpu" mode but unsurprisingly that's really slow - it took 35s to respond to "say hi"!

Anyone had success with this on a Mac yet? I really want to get this running with tool calling, ideally via an OpenAI-compatible serving layer like llama-server.

replies(2): >>44505665 #>>44507822 #

reach-vb ◴[09 Jul 25 09:11 UTC] No.44507822[source]▶

>>44505302 #

Hey Simon, VB from Hugging Face here and the person who added the model to MLX and llama.cpp (with Son). The PR hasn't yet landed on llama.cpp, hence it doesn't work OTB on llama.cpp installed via brew (similarly doesn't work with ollama since they need to bump their llama.cpp runtime)

The easiest would be to install llama.cpp from source: https://github.com/ggml-org/llama.cpp

If you want to avoid it, I added SmolLM3 to MLX-LM as well:

You can run it via `mlx_lm.chat --model "mlx-community/SmolLM3-3B-bf16"`

(requires the latest mlx-lm to be installed)

here's the MLX-lm PR if you're interested: https://github.com/ml-explore/mlx-lm/pull/272

similarly, llama.cpp here: https://github.com/ggml-org/llama.cpp/pull/14581

Let me know if you face any issues!

replies(1): >>44508731 #

1. kosolam ◴[09 Jul 25 11:33 UTC] No.44508731[source]▶

>>44507822 #

Could you please enlighten me regarding all these engines, I’m using lamacpp and ollama. Should I also try mlx, onnx, vllm, etc. I’m not quite sure whats the difference between all these. I’m running on CPU and sometimes GPU

replies(1): >>44510843 #

2. pzo ◴[09 Jul 25 15:00 UTC] No.44510843[source]▶

>>44508731 (TP) #

Ollama is a wrapper around llama.cpp thei using ggml format. Onnx is different ml model format and onnxruntime developer by microsoft. Mlx is ml framework from Apple. If you want the fastest speed on MacOS most likely stick with mlx

↑