ETH Zurich and EPFL to release a LLM developed on public infrastructure

1. k__ ◴[11 Jul 25 19:32 UTC] No.44536047[source]▶

"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #

2. Onavo ◴[11 Jul 25 20:12 UTC] No.44536377[source]▶

No performance degradation on training metrics except for the end user. At the end of the day users and website owners have completely orthogonal interests. Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master.

replies(1): >>44536456 #

3. esafak ◴[11 Jul 25 20:22 UTC] No.44536456[source]▶

>>44536377 #

> Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master

How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.

replies(1): >>44536786 #

4. Onavo ◴[11 Jul 25 21:06 UTC] No.44536786{3}[source]▶

>>44536456 #

You don't. You bypass them with crawlers and don't reveal your training data. And this is exactly why open source models can't surpass open weight models.

replies(1): >>44537425 #

5. diggan ◴[11 Jul 25 22:30 UTC] No.44537425{4}[source]▶

>>44536786 #

> And this is exactly why open source models can't surpass open weight models.

It is a fair point, but how strong of a point it is remains to be seen, some architectures are better than others, even with the same training data, so not impossible we could at one point see some innovative architectures beating current proprietary ones. It would probably be short-lived though, as the proprietary ones would obviously improve in their next release after that.

replies(1): >>44537860 #

6. jowea ◴[11 Jul 25 23:29 UTC] No.44537860{5}[source]▶

>>44537425 #

How can open source models respectful of robots.txt possibly perform equally if they are missing information that the other models have access to?

replies(3): >>44538826 #>>44539844 #>>44540399 #

7. JKCalhoun ◴[12 Jul 25 02:19 UTC] No.44538760[source]▶

>>44536047 (TP) #

Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.

I understand the web is a dynamic thing but still it would seem to be useful on some level.

replies(1): >>44540972 #

8. datameta ◴[12 Jul 25 02:36 UTC] No.44538826{6}[source]▶

>>44537860 #

How can we possibly find out without trying?

replies(1): >>44538893 #

9. jowea ◴[12 Jul 25 02:49 UTC] No.44538893{7}[source]▶

>>44538826 #

It is logically impossible for a LLM to, for example, to know that fooExecute() takes two int arguments if the documentation is blocked by robots.txt and there are no examples of fooExecute() usage in the wild, don't you agree?

replies(2): >>44539576 #>>44540377 #

10. stephen_cagle ◴[12 Jul 25 03:24 UTC] No.44539036[source]▶

>>44536047 (TP) #

I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.

replies(2): >>44539370 #>>44539981 #

11. conradkay ◴[12 Jul 25 04:46 UTC] No.44539370[source]▶

>>44539036 #

My guess is that it doesn't remove that much of the data, and the post-training data (not just randomly scraped from the web) probably matters more

12. tharant ◴[12 Jul 25 05:32 UTC] No.44539576{8}[source]▶

>>44538893 #

Sure, the model would not “know” about your example, but that’s not the point; the penultimate[0] goal is for the model to figure out the method signature on its own just like a human dev might leverage her own knowledge and experience to infer that method signature. Intelligence isn’t just rote memorization.

[0] the ultimate, of course, being profit.

replies(1): >>44539831 #

13. jowea ◴[12 Jul 25 06:31 UTC] No.44539831{9}[source]▶

>>44539576 #

I don't think a human dev can divine a method signature and effects in the general case either. Sure the add() function probably takes 2 numbers, but maybe it takes a list? Or a two-tuple? How would we or the LLM know without having the documentation? And yeah sure the LLM can look at the documentation while being used instead of it being part of the training dataset, but that's strictly inferior for practical uses, no?

I'm not sure if we're thinking of the same field of AI development. I think I'm talking about the super-autocomplete with integrated copy of all of digitalized human knowledge, while you're talking about trying to do (proto-)AGI. Is that it?

replies(1): >>44540274 #

14. Dylan16807 ◴[12 Jul 25 06:34 UTC] No.44539844{6}[source]▶

>>44537860 #

Maybe the missing data makes it 3% worse but the architecture is 5% better. Or your respect for robots.txt gets you more funding and you gain a 4% advantage by training longer.

Don't focus too much on a single variable, especially when all the variables have diminishing returns.

15. lllllm ◴[12 Jul 25 07:12 UTC] No.44539981[source]▶

>>44539036 #

Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.

[1] https://arxiv.org/abs/2504.06219

16. heavenlyblue ◴[12 Jul 25 08:21 UTC] No.44540274{10}[source]▶

>>44539831 #

> Sure the add() function probably takes 2 numbers, but maybe it takes a list? Or a two-tuple? How would we or the LLM know without having the documentation?

You just listed possible options in the order of their relative probability. Human would attempt to use them in exactly that order

17. diggan ◴[12 Jul 25 08:41 UTC] No.44540377{8}[source]▶

>>44538893 #

I agree, but also think it's less important. I don't want a big fat LLM that memorized every API out there, and as soon as the API changed, the weights have to updated. I like the current approach of Codex (and similar) where they can look up the APIs they need to use as they're doing the work instead, so same weights will continue to work no matter how much the APIs change.

18. lllllm ◴[12 Jul 25 08:45 UTC] No.44540399{6}[source]▶

>>44537860 #

this is what this paper tries to answer: https://arxiv.org/abs/2504.06219 the quality gap is surprisingly small between compliant and not

19. CaptainFever ◴[12 Jul 25 10:32 UTC] No.44540972[source]▶

>>44538760 #

Common Crawl, maybe?