(ethz.ch)

524 points andy99 | 3 comments | 11 Jul 25 18:45 UTC | HN request time: 0.554s | source

Show context

k__ ◴[11 Jul 25 19:32 UTC] No.44536047[source]▶

>>44535637 (OP) #

"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #

1. stephen_cagle ◴[12 Jul 25 03:24 UTC] No.44539036[source]▶

>>44536047 #

I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.

replies(2): >>44539370 #>>44539981 #

2. conradkay ◴[12 Jul 25 04:46 UTC] No.44539370[source]▶

>>44539036 (TP) #

My guess is that it doesn't remove that much of the data, and the post-training data (not just randomly scraped from the web) probably matters more

3. lllllm ◴[12 Jul 25 07:12 UTC] No.44539981[source]▶

>>44539036 (TP) #

Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.

[1] https://arxiv.org/abs/2504.06219

↑

ETH Zurich and EPFL to release a LLM developed on public infrastructure