ETH Zurich and EPFL to release a LLM developed on public infrastructure

1. WeirderScience ◴[11 Jul 25 20:05 UTC] No.44536327[source]▶

The open training data is a huge differentiator. Is this the first truly open dataset of this scale? Prior efforts like The Pile were valuable, but had limitations. Curious to see how reproducible the training is.

replies(2): >>44536400 #>>44537249 #

2. layer8 ◴[11 Jul 25 20:16 UTC] No.44536400[source]▶

>>44536327 (TP) #

> The model will be fully open: source code and weights will be publicly available, and the training data will be transparent and reproducible

This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.

replies(3): >>44536448 #>>44536623 #>>44536818 #

3. WeirderScience ◴[11 Jul 25 20:21 UTC] No.44536448[source]▶

>>44536400 #

Yeah, I suspect you're right. Still, even a list of URLs for a frontier model (assuming it does turn out to be of that level) would be welcome over the current situation.

4. glhaynes ◴[11 Jul 25 20:44 UTC] No.44536623[source]▶

>>44536400 #

That wouldn't seem reproducible if the content at those URLs changes. (Er, unless it was all web.archive.org URLs or something.)

replies(1): >>44536997 #

5. TobTobXX ◴[11 Jul 25 21:11 UTC] No.44536818[source]▶

>>44536400 #

Well, when the actual content is 100s of terabytes big, providing URLs may be more practical for them and for others.

replies(1): >>44537342 #

6. dietr1ch ◴[11 Jul 25 21:36 UTC] No.44536997{3}[source]▶

>>44536623 #

This is a problem with the Web. It should be easier to download content like it was updating a git Repo.

7. evolvedlight ◴[11 Jul 25 22:07 UTC] No.44537249[source]▶

>>44536327 (TP) #

Yup, it’s not a dataset packaged like you hope for here, as it still contains traditionally copyrighted material

8. layer8 ◴[11 Jul 25 22:19 UTC] No.44537342{3}[source]▶

>>44536818 #

The difference between content they are allowed to train on vs. being allowed to distribute copies of is likely at least as relevant.