Smollm3: Smol, multilingual, long-context reasoner LLM

1. _1 ◴[08 Jul 25 17:09 UTC] No.44501951[source]▶

Which small model is good for fine tuning to various enterprise data sets? Our business units are wanting to run small models in browser and on mobile devices, without dealing with RAG and cloud resources.

replies(5): >>44502175 #>>44502283 #>>44502496 #>>44502868 #>>44508851 #

2. mhitza ◴[08 Jul 25 17:31 UTC] No.44502175[source]▶

>>44501951 (TP) #

You really need to try them all out yourself and make sure you have proper benchmarks.

While machine learning is not my field, I've tried to finetune Mistral 7B (following their official guide and toolset) and the results did not satisfy. Had a few very specific questions from the dataset that no matter how much I've finetuned and tweaked the process it was not able to respond with correct information.

A mix of vector search + keyword search is still better at building the right question context than expecting it to learn all the information.

I've used the pretrained dataset approach. Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.

replies(2): >>44503664 #>>44505274 #

3. gardnr ◴[08 Jul 25 17:43 UTC] No.44502283[source]▶

>>44501951 (TP) #

Small models are bad at knowing things. Trying to train knowledge in to small models is probably not the way you want to go. You could try building an offline embedded RAG system that is deployable as wasm. Some folks have been experiencing success with this.

replies(1): >>44502398 #

4. _1 ◴[08 Jul 25 17:56 UTC] No.44502398[source]▶

>>44502283 #

We do use WebLLM and a hosted Weaviate database, but there are complaints about speed (both retrieval and time to first token as the context will get big). The Gemma 3n "nesting doll" approach sounds like it could be useful .. but haven't found anyone specifically doing it to add domain specific knowledge.

replies(1): >>44502867 #

5. simonw ◴[08 Jul 25 18:07 UTC] No.44502496[source]▶

>>44501951 (TP) #

What are you hoping to achieve by fine-tuning a model in this way?

6. janalsncm ◴[08 Jul 25 18:50 UTC] No.44502867{3}[source]▶

>>44502398 #

Typically retrieval is the fast part in my experience. Have you considered cheaper retrieval methods? Bm25 does pretty well on its own. And you can augment your dataset by precomputing relevant queries for each doc.

7. netdur ◴[08 Jul 25 18:50 UTC] No.44502868[source]▶

>>44501951 (TP) #

I have fine-tuned Gemma 3N 2B and it's pretty good, but loads slow on my S23U, once it's loaded though, it works fine

Also tried SmolVLM 256M and 500M, they load faster and you can embed them in assets, they work if you know what you're doing

Just keep in mind that smaller models don't perform as well due to their limited parameters

Also on Android, since you can't ship files larger than 2GB due to Java compression issues, you need to download models separately, then you can't load the model from the download folder, you have to copy it into the app's own folder, this means a Gemma 3N 2B model that's 3.14 GB would need at least 7 GB of free space on the user's phone

8. ivape ◴[08 Jul 25 20:16 UTC] No.44503664[source]▶

>>44502175 #

How much data did you use to fine tune?

replies(1): >>44503815 #

9. mhitza ◴[08 Jul 25 20:34 UTC] No.44503815{3}[source]▶

>>44503664 #

Kilobytes to megabytes of data. I was trying to fine-tune it for some specific legislation I was expecting to be able afterwards to ask about.

10. magicalhippo ◴[09 Jul 25 00:35 UTC] No.44505274[source]▶

>>44502175 #

> Maybe building syntethic questions and answers around the dataset yields better results but I didn't have time to experiment with that approach.

While they answer a slightly different question in the Physics of Language Models[1], based on their results it seems to me it is likely that one needs to do such augmentation of the dataset to get good results.

However, they also show that the dataset the base model is trained on can drastically affect finetuning performance. So if the base model is trained on a poor dataset for your specific task, perhaps you'll never get good performance.

[1]: https://physics.allen-zhu.com/part-3-knowledge/part-3-1

11. thatjoeoverthr ◴[09 Jul 25 11:50 UTC] No.44508851[source]▶

>>44501951 (TP) #

Tuning is really not the way to add information.

Bite the bullet and do some kind of RAG; you need to provide clear, authoritative information to a model that is skilled enough to remix it for the user.

Tuning the model to imitate the dataset will damage the model's skills and "common sense" but won't train it reliably recall information.