←back to thread

646 points blendergeek | 1 comments | | HN request time: 0.27s | source
Show context
quchen ◴[] No.42725651[source]
Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.
replies(9): >>42725708 #>>42725957 #>>42725983 #>>42726183 #>>42726352 #>>42726426 #>>42727567 #>>42728923 #>>42730108 #
grajaganDev ◴[] No.42725708[source]
I am not sure. How would crawlers filter this?
replies(2): >>42725835 #>>42726294 #
1. marginalia_nu ◴[] No.42726294[source]
You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is.

There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.