←back to thread

514 points andy99 | 1 comments | | HN request time: 0.217s | source
Show context
k__ ◴[] No.44536047[source]
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #
stephen_cagle ◴[] No.44539036[source]
I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.
replies(2): >>44539370 #>>44539981 #
1. conradkay ◴[] No.44539370[source]
My guess is that it doesn't remove that much of the data, and the post-training data (not just randomly scraped from the web) probably matters more