←back to thread

533 points andy99 | 1 comments | | HN request time: 0.206s | source
Show context
k__ ◴[] No.44536047[source]
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #
stephen_cagle ◴[] No.44539036[source]
I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.
replies(2): >>44539370 #>>44539981 #
1. lllllm ◴[] No.44539981[source]
Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.

[1] https://arxiv.org/abs/2504.06219