←back to thread

524 points andy99 | 3 comments | | HN request time: 0.554s | source
Show context
k__ ◴[] No.44536047[source]
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #
1. stephen_cagle ◴[] No.44539036[source]
I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.
replies(2): >>44539370 #>>44539981 #
2. conradkay ◴[] No.44539370[source]
My guess is that it doesn't remove that much of the data, and the post-training data (not just randomly scraped from the web) probably matters more
3. lllllm ◴[] No.44539981[source]
Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.

[1] https://arxiv.org/abs/2504.06219