ETH Zurich and EPFL to release a LLM developed on public infrastructure

1. isusmelj ◴[11 Jul 25 20:30 UTC] No.44536509[source]▶

I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.

replies(5): >>44536696 #>>44536809 #>>44537201 #>>44539869 #>>44541746 #

2. luke-stanley ◴[11 Jul 25 20:53 UTC] No.44536696[source]▶

>>44536509 (TP) #

When I read "from scratch", I assume they are doing pre-training, not just finetuning, do you have a different take? Do you mean it's normal Llama architecture they're using? I'm curious about the benchmarks!

3. andy99 ◴[11 Jul 25 21:09 UTC] No.44536809[source]▶

>>44536509 (TP) #

Imo, a lot of the magic is also dataset driven, specifically the SFT and other fine tuning / RLHF data they have. That's what has separated the models people actually use from the also-rans.

I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.

4. alfalfasprout ◴[11 Jul 25 22:01 UTC] No.44537201[source]▶

>>44536509 (TP) #

The infra does become pretty complex to get a SOTA LLM trained. People assume it's as simple as loading up the architecture and a dataset + using something like Ray. There's a lot that goes into designing the dataset, the eval pipelines, the training approach, maximizing the use of your hardware, dealing with cross-node latency, recovering from errors, etc.

But it's good to have more and more players in this space.

5. lllllm ◴[12 Jul 25 06:39 UTC] No.44539869[source]▶

>>44536509 (TP) #

No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team

replies(6): >>44539877 #>>44540067 #>>44540272 #>>44540736 #>>44540850 #>>44540873 #

6. macawfish ◴[12 Jul 25 06:41 UTC] No.44539877[source]▶

>>44539869 #

Are you using dbpedia?

replies(1): >>44539987 #

7. lllllm ◴[12 Jul 25 07:13 UTC] No.44539987{3}[source]▶

>>44539877 #

no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq

replies(1): >>44540171 #

8. danielhanchen ◴[12 Jul 25 07:31 UTC] No.44540067[source]▶

>>44539869 #

If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)

replies(1): >>44540233 #

9. PeterStuer ◴[12 Jul 25 07:54 UTC] No.44540171{4}[source]▶

>>44539987 #

Thx for engaging here.

Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.

(I'm personaly most interested in covering the 24 official EU languages)

replies(1): >>44540219 #

10. lllllm ◴[12 Jul 25 08:06 UTC] No.44540219{5}[source]▶

>>44540171 #

we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale

11. lllllm ◴[12 Jul 25 08:11 UTC] No.44540233{3}[source]▶

>>44540067 #

absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!

12. isusmelj ◴[12 Jul 25 08:21 UTC] No.44540272[source]▶

>>44539869 #

Thanks for clarifying! I wish you all the best luck!

13. Al-Khwarizmi ◴[12 Jul 25 09:50 UTC] No.44540736[source]▶

>>44539869 #

So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!

replies(1): >>44540875 #

14. moffkalast ◴[12 Jul 25 10:09 UTC] No.44540850[source]▶

>>44539869 #

L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15. d3m0t3p ◴[12 Jul 25 10:13 UTC] No.44540873[source]▶

>>44539869 #

Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?

16. badsectoracula ◴[12 Jul 25 10:13 UTC] No.44540875{3}[source]▶

>>44540736 #

Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).

17. asjir ◴[12 Jul 25 12:58 UTC] No.44541746[source]▶

>>44536509 (TP) #

I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.

replies(1): >>44541843 #

18. zettabomb ◴[12 Jul 25 13:12 UTC] No.44541843[source]▶

>>44541746 #

SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.