Most active commenters

janalsncm(3)
netdur(3)
ivape(3)

Popular/hot comments

>>44502203 #
>>44505653 #
>>44501951 #
>>44501872 #

Smollm3: Smol, multilingual, long-context reasoner LLM

(huggingface.co)

1. gardnr ◴[08 Jul 25 16:56 UTC] No.44501814[source]▶

It's small (3B) and does great on benchmarks. This is a model for edge / mobile deployments so the gains over gemma3-4b are meaningful. It has dual mode reasoning / non_reasoning AND they released the full training method:

> We're releasing SmolLM3 with our engineering blueprint. It includes architecture details, exact data mixtures showing how we progressively boost performance across domains in a three-stage pretraining approach, and the methodology for building a hybrid reasoning model. Usually, achieving these results would require months of reverse engineering. Instead, we're providing the full methodology.

replies(1): >>44509990 #

2. tiahura ◴[08 Jul 25 17:01 UTC] No.44501872[source]▶

>>44501413 (OP) #

Can anyone estimate how much of the 3B is necessitated by multi-language support?

replies(3): >>44502099 #>>44509476 #>>44509763 #

3. nateb2022 ◴[08 Jul 25 17:02 UTC] No.44501879[source]▶

>>44501413 (OP) #

https://web.archive.org/web/20250708164705/https://huggingfa...

4. _1 ◴[08 Jul 25 17:09 UTC] No.44501951[source]▶

>>44501413 (OP) #

Which small model is good for fine tuning to various enterprise data sets? Our business units are wanting to run small models in browser and on mobile devices, without dealing with RAG and cloud resources.

replies(5): >>44502175 #>>44502283 #>>44502496 #>>44502868 #>>44508851 #

5. rockinghigh ◴[08 Jul 25 17:23 UTC] No.44502099[source]▶

>>44501872 #

The vocabulary size is fairly small (128,256) for a multilingual model. I would guess it doesn't require many additional parameters to support these 5 languages as many tokens can be shared.

6. WhitneyLand ◴[08 Jul 25 17:27 UTC] No.44502146[source]▶

>>44501413 (OP) #

Mostly SOTA performance at the 3B level. A notable addition to the small but truly open club of models that provide full disclosure, code, recipes to reproduce their work.

Looks like ballpark a million dollars of GPU time if you want to train up one for yourself (4000 gpus/24 days).

Very nice write up that’s generous in sharing their learnings.

This is a solid and positive contribution.

replies(2): >>44502692 #>>44504060 #

7. mhitza ◴[08 Jul 25 17:31 UTC] No.44502175[source]▶

>>44501951 #

You really need to try them all out yourself and make sure you have proper benchmarks.

While machine learning is not my field, I've tried to finetune Mistral 7B (following their official guide and toolset) and the results did not satisfy. Had a few very specific questions from the dataset that no matter how much I've finetuned and tweaked the process it was not able to respond with correct information.

A mix of vector search + keyword search is still better at building the right question context than expecting it to learn all the information.

I've used the pretrained dataset approach. Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.

replies(2): >>44503664 #>>44505274 #

8. bitwize ◴[08 Jul 25 17:34 UTC] No.44502203[source]▶

>>44501413 (OP) #

There's a British comedy skit lurking in here.

"So it's a small large language model?"

"Oh yes, very small."

"How can it be small and large at the same time?"

"Well, it's small by the standards of a large language model."

"So it's large."

"Oh yes, very large."

"Large compared to what?"

"Small language models."

"And so something like ChatGPT, what would that be exactly? A large large language model?"

"Yes, precisely. An LLLM."

replies(6): >>44502878 #>>44502893 #>>44502939 #>>44503785 #>>44504888 #>>44506490 #

9. gardnr ◴[08 Jul 25 17:43 UTC] No.44502283[source]▶

>>44501951 #

Small models are bad at knowing things. Trying to train knowledge in to small models is probably not the way you want to go. You could try building an offline embedded RAG system that is deployable as wasm. Some folks have been experiencing success with this.

replies(1): >>44502398 #

10. msgodel ◴[08 Jul 25 17:49 UTC] No.44502338[source]▶

>>44501413 (OP) #

Wow. Close to a Qwen3 distill with 75% the size. That's great!

I've been using the smollm base models for my own finetunes just because they're so high quality, it looks like I might be using them to drive local agents/code completion in the near future too.

Their RL algorithm looks interesting. I'm still using OpenAI's algorithm for my stuff, I've been meaning to check on the SoTA since I know my code is pretty outdated (It's crazy how fast that happens with this stuff.)

11. gdiamos ◴[08 Jul 25 17:49 UTC] No.44502342[source]▶

>>44501413 (OP) #

Nice work anton et al.

I hope you continue the 50-100M parameter models.

I think there is a case for models that finish fast on CPUs in solve by llm test cases.

12. _1 ◴[08 Jul 25 17:56 UTC] No.44502398{3}[source]▶

>>44502283 #

We do use WebLLM and a hosted Weaviate database, but there are complaints about speed (both retrieval and time to first token as the context will get big). The Gemma 3n "nesting doll" approach sounds like it could be useful .. but haven't found anyone specifically doing it to add domain specific knowledge.

replies(1): >>44502867 #

13. simonw ◴[08 Jul 25 18:07 UTC] No.44502496[source]▶

>>44501951 #

What are you hoping to achieve by fine-tuning a model in this way?

14. eachro ◴[08 Jul 25 18:17 UTC] No.44502588[source]▶

>>44501413 (OP) #

From what I've heard, the llama3 models are fairly easy to fine-tune (please correct me if I'm wrong or if there are more amenable models here). How easy is it to finetune smollm3? I know a lot of the MoE LLMs have been quite fickle in this regard.

15. BarakWidawsky ◴[08 Jul 25 18:24 UTC] No.44502634[source]▶

>>44501413 (OP) #

It’s interesting that it looks like they didn’t apply their own RL to the model, and instead fine tuned on reasoning traces from large datasets and generating reasoning traces from larger models

replies(1): >>44502761 #

16. YetAnotherNick ◴[08 Jul 25 18:31 UTC] No.44502692[source]▶

>>44502146 #

It's 384 H100s for 24 days, costing less than half a million dollars.

replies(2): >>44503252 #>>44505653 #

17. lewtun ◴[08 Jul 25 18:38 UTC] No.44502761[source]▶

>>44502634 #

Indeed we opted for offline methods like Anchored Preference Optimization as we found in the Open R1 project that doing multi-task RL on small models is quite a hassle to get right. With offline methods, you focus much more on dataset curation / generation, but that still provides faster iteration cycles for the model scale we’re dealing with!

18. janalsncm ◴[08 Jul 25 18:50 UTC] No.44502867{4}[source]▶

>>44502398 #

Typically retrieval is the fast part in my experience. Have you considered cheaper retrieval methods? Bm25 does pretty well on its own. And you can augment your dataset by precomputing relevant queries for each doc.

19. netdur ◴[08 Jul 25 18:50 UTC] No.44502868[source]▶

>>44501951 #

I have fine-tuned Gemma 3N 2B and it's pretty good, but loads slow on my S23U, once it's loaded though, it works fine

Also tried SmolVLM 256M and 500M, they load faster and you can embed them in assets, they work if you know what you're doing

Just keep in mind that smaller models don't perform as well due to their limited parameters

Also on Android, since you can't ship files larger than 2GB due to Java compression issues, you need to download models separately, then you can't load the model from the download folder, you have to copy it into the app's own folder, this means a Gemma 3N 2B model that's 3.14 GB would need at least 7 GB of free space on the user's phone

20. netdur ◴[08 Jul 25 18:51 UTC] No.44502878[source]▶

>>44502203 #

it's big little planet or small big planet?

21. janalsncm ◴[08 Jul 25 18:54 UTC] No.44502893[source]▶

>>44502203 #

Standards have shifted as well. Gpt2 used to be considered “large” but it is half the size of this. Oh and also Sam Altman said it was too dangerous to release. At this point I consider anything too big to run on consumer grade hardware to be large, but an exact definition is a little silly to argue about.

replies(2): >>44504325 #>>44507408 #

22. ◴[08 Jul 25 19:00 UTC] No.44502939[source]▶

>>44502203 #

23. Imustaskforhelp ◴[08 Jul 25 19:33 UTC] No.44503252{3}[source]▶

>>44502692 #

Pardon me, but is the dataset public.

Like if I really really just wanted to build it from scratch, could I do so? (not that I have that money but just curious)

replies(1): >>44503264 #

24. hynky ◴[08 Jul 25 19:35 UTC] No.44503264{4}[source]▶

>>44503252 #

yes, both core web datasets are publicly available as well as the rest

replies(1): >>44503289 #

25. Imustaskforhelp ◴[08 Jul 25 19:37 UTC] No.44503289{5}[source]▶

>>44503264 #

Thanks!

To be honest, if I might argue then that this is one of the best truly open source models that we have got.

There is AllenAI and (Elmo?) and there is also this one which does distributed training but I think this looks a lot like SOTA for 3B parameters to me.

Thanks for telling me, I am not going to lie, I am going to try to test it now! (Ima try some GGUF since ollama convenience)

replies(1): >>44504387 #

26. ivape ◴[08 Jul 25 20:16 UTC] No.44503664{3}[source]▶

>>44502175 #

How much data did you use to fine tune?

replies(1): >>44503815 #

27. ivape ◴[08 Jul 25 20:21 UTC] No.44503710[source]▶

>>44501413 (OP) #

Looks like it's the 3B models that are being shipped out to on device by default. Apple's on-device LLM is 3B, and I believe Canary is shipping Google nano:

https://developer.chrome.com/docs/ai/rewriter-api

28. ivape ◴[08 Jul 25 20:25 UTC] No.44503765[source]▶

>>44501413 (OP) #

I wonder if this will be cheaper than llama 3.1 8b on OpenRouter.

29. papichulo2023 ◴[08 Jul 25 20:29 UTC] No.44503785[source]▶

>>44502203 #

Do not mess with the Miniature giant space hamsters

30. mhitza ◴[08 Jul 25 20:34 UTC] No.44503815{4}[source]▶

>>44503664 #

Kilobytes to megabytes of data. I was trying to fine-tune it for some specific legislation I was expecting to be able afterwards to ask about.

31. refulgentis ◴[08 Jul 25 21:03 UTC] No.44504060[source]▶

>>44502146 #

I spent about 10 minutes this AM cross-checking with Phi-4-mini benchmarks, as it was very odd to not include the leader in benchmarks and it seemed universally behind.

For context, I dev an LLM client, a core tenant is keeping local as close to cloud parity as much as is possible. (via llama.cpp)

Companies aren't taking local AI seriously on a sustained basis outside Microsoft.

Overall, I usually would bite my tongue. HF is a great citizen, and I doubt this'll be a one off. However, when I see superlatives affirmed, while leaving out the local SoTA for many many moons that is a godsend in this sector, I think it is good to, rather than shy away, stand up and say this.

replies(1): >>44504307 #

32. adrianlzt ◴[08 Jul 25 21:42 UTC] No.44504307{3}[source]▶

>>44504060 #

From the blog post: "SmolLM3 supports tool calling, and its chat template incorporates two distinct sections for tool descriptions: XML Tools and Python Tools"

33. a_wild_dandan ◴[08 Jul 25 21:44 UTC] No.44504325{3}[source]▶

>>44502893 #

Altman released GPT-2 despite expressing that doing so was a bad idea? That's wild.

replies(1): >>44504590 #

34. peatmoss ◴[08 Jul 25 21:52 UTC] No.44504387{6}[source]▶

>>44503289 #

OLMo: https://allenai.org/olmo

AFAIK, they were the first open everything model.

replies(1): >>44507803 #

35. Alifatisk ◴[08 Jul 25 22:25 UTC] No.44504590{4}[source]▶

>>44504325 #

I think Altman meant it's too dangerous to open-source GPT-2, therefore locked it in behind a service.

replies(1): >>44505067 #

36. danielhanchen ◴[08 Jul 25 22:46 UTC] No.44504715[source]▶

>>44501413 (OP) #

I fixed some chat template issues for llama.cpp and other inference engines! To run it, do:

./llama.cpp/llama-cli -hf unsloth/SmolLM3-3B-GGUF:Q4_K_XL --jinja -ngl 99

replies(2): >>44505656 #>>44507813 #

37. _kb ◴[08 Jul 25 23:17 UTC] No.44504888[source]▶

>>44502203 #

Australian. This is straight up Clarke and Dawe / Utopia.

replies(2): >>44505329 #>>44505701 #

38. janalsncm ◴[08 Jul 25 23:53 UTC] No.44505067{5}[source]▶

>>44504590 #

It’s not locked behind a service though.

https://huggingface.co/openai-community/gpt2/blob/main/model...

replies(1): >>44507162 #

39. magicalhippo ◴[09 Jul 25 00:35 UTC] No.44505274{3}[source]▶

>>44502175 #

> Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.

While they answer a slightly different question in the Physics of Language Models[1], based on their results it seems to me it is likely that one needs to do such augmentation of the dataset to get good results.

However, they also show that the dataset the base model is trained on can drastically affect finetuning performance. So if the base model is trained on a poor dataset for your specific task, perhaps you'll never get good performance.

[1]: https://physics.allen-zhu.com/part-3-knowledge/part-3-1

40. simonw ◴[09 Jul 25 00:43 UTC] No.44505302[source]▶

>>44501413 (OP) #

I'm having trouble running this on my Mac - I've tried Ollama and llama.cpp llama-server so far, both using GGUFs from Hugging Face, but neither worked.

(llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'smollm3')

I've managed to run it using Python and transformers with PyTorch in device="cpu" mode but unsurprisingly that's really slow - it took 35s to respond to "say hi"!

Anyone had success with this on a Mac yet? I really want to get this running with tool calling, ideally via an OpenAI-compatible serving layer like llama-server.

replies(2): >>44505665 #>>44507822 #

41. bitwize ◴[09 Jul 25 00:49 UTC] No.44505329{3}[source]▶

>>44504888 #

I must confess, I was inspired by "the front fell off".

42. grrowl ◴[09 Jul 25 01:25 UTC] No.44505512[source]▶

>>44501413 (OP) #

Great to see Huggingface stick to their guns with CodeEval and python tooling. Agentic turn-by-turn tool calling is fine and all, but we're underutilising their ability to write an execute code in an "agent-like" environment.

43. segmondy ◴[09 Jul 25 01:53 UTC] No.44505653{3}[source]▶

>>44502692 #

H100 are going for about $3/hr, 384243 ~ $28k

replies(6): >>44505754 #>>44505979 #>>44506134 #>>44507506 #>>44507964 #>>44509849 #

44. segmondy ◴[09 Jul 25 01:54 UTC] No.44505656[source]▶

>>44504715 #

doing the good work, thanks daniel!

replies(1): >>44505857 #

45. tripplyons ◴[09 Jul 25 01:57 UTC] No.44505665[source]▶

>>44505302 #

Have you tried setting device="mps" to use Metal? It should be faster than PyTorch's "cpu" device on Mac.

46. viraptor ◴[09 Jul 25 02:03 UTC] No.44505701{3}[source]▶

>>44504888 #

"Yes, a British Australian comedy sketch."

"So it's British?"

"By heritage."

"But Australian?"

"By production."

"Ah, so it’s satire."

"It was, until someone funded it."

47. jazzyjackson ◴[09 Jul 25 02:11 UTC] No.44505754{4}[source]▶

>>44505653 #

Take this brother, \*, it may serve you well

48. danielhanchen ◴[09 Jul 25 02:35 UTC] No.44505857{3}[source]▶

>>44505656 #

Thank you!

replies(1): >>44509678 #

49. dr_kretyn ◴[09 Jul 25 02:58 UTC] No.44505979{4}[source]▶

>>44505653 #

The price just keeps on dropping with each comment. Anyone going to estimate it for less?

What's the source for $3/h?

replies(1): >>44506274 #

50. jrk ◴[09 Jul 25 03:38 UTC] No.44506134{4}[source]▶

>>44505653 #

This is indeed a reasonable cost estimate for competitive short-term H100 rentals (source: much SemiAnalysis coverage, and my own exploration of the market), but there is a critical error (besides the formatting glitch with `*`):

It was 24 days (576 hours) not 24 hours. $663,552 @ $3/hr.

replies(1): >>44509470 #

51. pests ◴[09 Jul 25 04:10 UTC] No.44506274{5}[source]▶

>>44505979 #

They miscalculated only 24 hours, not 24 days, so their number is off by a factor of 24.

52. ◴[09 Jul 25 05:08 UTC] No.44506490[source]▶

>>44502203 #

53. cess11 ◴[09 Jul 25 06:54 UTC] No.44507019[source]▶

>>44501413 (OP) #

I've tried to use gemma3:4b which comes up better in that benchmark and found it to be quite disappointing. It breaks a lot, sucks even worse than qwen2.5-coder:7b and incept5/llama3.1-claude:7b at code, needs to be tricked or threatened into saying stuff about many everyday topics. It also commonly chugs away for minutes exercising the GPU fans before responding, at which point I'm already ahead because I figured out another way to solve my problem or get at some information.

My experience with phi4-mini and granite3.3 was about the same, and they annoy me even more when I hook them into code editors and try to get them to contribute to my work. For one because they're slow, and at best they suggest adding unnecessary error handling in the style of null checks everywhere, at worst they just start mixing or hallucinating programming languages. Where they would be useful as leverage if they worked, i.e. close to the edge of where I can debug and refactor without getting stuck, they just go into straight nonsense mode, especially on terse first-pass code.

Sometimes I've tried to query these things for descriptions of recent history in foreign countries, Wikipedia trivia basically, and they're very often wrong in subtle ways. For example, a politician might have been at it for half a century or so in a troubled country and because they've been ousted in a coup once in the eighties the model is absolutely sure they can't have been in office since.

If a person acted like these things do I'd wish for them to get immediate institutional care. Maybe the problem is somehow with me, but I have a deep suspicion it's not.

replies(1): >>44508802 #

54. Alifatisk ◴[09 Jul 25 07:23 UTC] No.44507162{6}[source]▶

>>44505067 #

That’s only 124M param

replies(1): >>44508906 #

55. creshal ◴[09 Jul 25 08:07 UTC] No.44507408{3}[source]▶

>>44502893 #

"consumer grade hardware" is a rather loose definition too, what with RAM on consumer devices being anything between 2GB (low-end phones) and >100GB (high-end laptops/desktops) these days.

56. YetAnotherNick ◴[09 Jul 25 08:23 UTC] No.44507506{4}[source]▶

>>44505653 #

You can buy for $2.2/GPU/hr for on-demand and likely around $2 for this big order.

[1]: https://datacrunch.io/products#H100

57. diggan ◴[09 Jul 25 09:08 UTC] No.44507803{7}[source]▶

>>44504387 #

> AFAIK, they were the first open everything model.

GPT2 (released ~5 years ago?) was "open" in the sense that weights were available for download (sans license), exact datasets that were used where outlined, the architecture explained and so on, so I guess it was also "open" in the sense that Llama is "open", but neither would be "open source" which I'd feel pretty confident to label OLMo with.

So OLMo seems to be the first actually "open source" model, but maybe not "open" as in "downloadable" (which Facebook tries to call "open source").

58. diggan ◴[09 Jul 25 09:10 UTC] No.44507813[source]▶

>>44504715 #

> fixed some chat template issues

This seems to be a persistent issue with almost all weight releases, even from bigger companies like Meta.

Are the people who release these weights not testing them in various inference engines? Seems they make it work with Huggingface's Transformers library, then call it a day, but sometimes not even that.

replies(1): >>44508570 #

59. reach-vb ◴[09 Jul 25 09:11 UTC] No.44507822[source]▶

>>44505302 #

Hey Simon, VB from Hugging Face here and the person who added the model to MLX and llama.cpp (with Son). The PR hasn't yet landed on llama.cpp, hence it doesn't work OTB on llama.cpp installed via brew (similarly doesn't work with ollama since they need to bump their llama.cpp runtime)

The easiest would be to install llama.cpp from source: https://github.com/ggml-org/llama.cpp

If you want to avoid it, I added SmolLM3 to MLX-LM as well:

You can run it via `mlx_lm.chat --model "mlx-community/SmolLM3-3B-bf16"`

(requires the latest mlx-lm to be installed)

here's the MLX-lm PR if you're interested: https://github.com/ml-explore/mlx-lm/pull/272

similarly, llama.cpp here: https://github.com/ggml-org/llama.cpp/pull/14581

Let me know if you face any issues!

replies(1): >>44508731 #

60. social_quotient ◴[09 Jul 25 09:40 UTC] No.44507964{4}[source]▶

>>44505653 #

Runpod is worth a look for these on demand workloads https://www.runpod.io/pricing I use a lot for ffmpeg workloads.

Found this a few days ago which might be neat for finding cheaper https://www.primeintellect.ai/

No affiliation with either

61. clarionbell ◴[09 Jul 25 11:12 UTC] No.44508570{3}[source]▶

>>44507813 #

No they don't. Why would they? Most of them are using a single inference engine, most likely developed inhouse. Or they go for something like vLLM, but llama.cpp especially is under their radar.

The reason is simple. There isn't much money in it. llama.cpp is free and targets lower end of the hardware spectrum. Corporations will run something else, or even more likely, offload the task to contractor.

62. kosolam ◴[09 Jul 25 11:33 UTC] No.44508731{3}[source]▶

>>44507822 #

Could you please enlighten me regarding all these engines, I’m using lamacpp and ollama. Should I also try mlx, onnx, vllm, etc. I’m not quite sure whats the difference between all these. I’m running on CPU and sometimes GPU

63. thatjoeoverthr ◴[09 Jul 25 11:50 UTC] No.44508851[source]▶

>>44501951 #

Tuning is really not the way to add information.

Bite the bullet and do some kind of RAG; you need to provide clear, authoritative information to a model that is skilled enough to remix it for the user.

Tuning the model to imitate the dataset will damage the model's skills and "common sense" but won't train it reliably recall information.

64. thatjoeoverthr ◴[09 Jul 25 11:56 UTC] No.44508906{7}[source]▶

>>44507162 #

Behold https://huggingface.co/openai-community/gpt2-xl

65. mromanuk ◴[09 Jul 25 12:54 UTC] No.44509470{5}[source]▶

>>44506134 #

According to Runpod pricing page, you can run H100 for $2.39, it can go as lower as $528,629.76

WARNING: This is highly speculative and napkin math

H200 (141 GB HBM3 - $3.99/h - 1.4x perf) 216 x 24 x 17 = 88128h = 351.895,104 (17 days and 216 cards)

B200 (192 GB HBM3e - $5.99/h - 2.8x perf) 158 x 24 x 9 = 34128h = $204.426,72

Probably wrong math, should be more efficient and cheaper. Doubt that they have 100/200 cards available for that long.

Source: I've only trained using RTX4090 and stuff like that with 8 cards.

Not affiliated in any way with Runpod.

66. ethan_smith ◴[09 Jul 25 12:55 UTC] No.44509476[source]▶

>>44501872 #

Typically, multilingual capabilities consume 20-30% of model parameters in small LLMs, primarily in token embeddings and early transformer layers. Monolingual variants of similar models often perform better on English benchmarks with the same parameter count.

67. v5v3 ◴[09 Jul 25 13:15 UTC] No.44509678{4}[source]▶

>>44505857 #

Thanks

68. netdur ◴[09 Jul 25 13:24 UTC] No.44509763[source]▶

>>44501872 #

naive look, 2/3 of model, without multi-languages this shiuld be around 1B

69. lhl ◴[09 Jul 25 13:31 UTC] No.44509849{4}[source]▶

>>44505653 #

You can go much lower: https://gpulist.ai/

70. sigmoid10 ◴[09 Jul 25 13:46 UTC] No.44509990[source]▶

>>44501814 #

I hate to say it, but reasoning models simply aren't suited for edge computing. I just ran some tests on this model and even at 4bit weight quantisation it blows past 10GB of VRAM with just ~1000 tokens while it is still reasoning. So even if you're running on a dedicated ML edge device like a $250 Jetson, you will run out of memory before the model even formulates a real answer. You'll need a high end GPU to make full use of it for limited answers and an enterprise grade system to support longer contexts. And with reasoning turned off I don't see any meaningful improvement over older models.

So this is primarily great for enterprises who want to do on-prem with limited budgets and maybe high-end enthusiasts.

↑