ETH Zurich and EPFL to release a LLM developed on public infrastructure

(ethz.ch)

Show context

isusmelj ◴[11 Jul 25 20:30 UTC] No.44536509[source]▶

I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.

replies(5): >>44536696 #>>44536809 #>>44537201 #>>44539869 #>>44541746 #

1. lllllm ◴[12 Jul 25 06:39 UTC] No.44539869[source]▶

>>44536509 #

No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team

replies(6): >>44539877 #>>44540067 #>>44540272 #>>44540736 #>>44540850 #>>44540873 #

2. macawfish ◴[12 Jul 25 06:41 UTC] No.44539877[source]▶

>>44539869 (TP) #

Are you using dbpedia?

replies(1): >>44539987 #

3. lllllm ◴[12 Jul 25 07:13 UTC] No.44539987[source]▶

>>44539877 #

no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq

replies(1): >>44540171 #

4. danielhanchen ◴[12 Jul 25 07:31 UTC] No.44540067[source]▶

>>44539869 (TP) #

If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)

replies(1): >>44540233 #

5. PeterStuer ◴[12 Jul 25 07:54 UTC] No.44540171{3}[source]▶

>>44539987 #

Thx for engaging here.

Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.

(I'm personaly most interested in covering the 24 official EU languages)

replies(1): >>44540219 #

6. lllllm ◴[12 Jul 25 08:06 UTC] No.44540219{4}[source]▶

>>44540171 #

we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale

7. lllllm ◴[12 Jul 25 08:11 UTC] No.44540233[source]▶

>>44540067 #

absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!

8. isusmelj ◴[12 Jul 25 08:21 UTC] No.44540272[source]▶

>>44539869 (TP) #

Thanks for clarifying! I wish you all the best luck!

9. Al-Khwarizmi ◴[12 Jul 25 09:50 UTC] No.44540736[source]▶

>>44539869 (TP) #

So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!

replies(1): >>44540875 #

10. moffkalast ◴[12 Jul 25 10:09 UTC] No.44540850[source]▶

>>44539869 (TP) #

L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb

11. d3m0t3p ◴[12 Jul 25 10:13 UTC] No.44540873[source]▶

>>44539869 (TP) #

Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?

12. badsectoracula ◴[12 Jul 25 10:13 UTC] No.44540875[source]▶

>>44540736 #

Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).

↑