←back to thread

646 points blendergeek | 3 comments | | HN request time: 0.728s | source
Show context
quchen ◴[] No.42725651[source]
Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.
replies(9): >>42725708 #>>42725957 #>>42725983 #>>42726183 #>>42726352 #>>42726426 #>>42727567 #>>42728923 #>>42730108 #
1. Blackthorn ◴[] No.42725957[source]
If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished!
replies(2): >>42726400 #>>42727416 #
2. gruez ◴[] No.42726400[source]
>If it means it makes your own content safe

Not really? As mentioned by others, such tarpits are easily mitigated by using a priority queue. For instance, crawlers can prioritize external links over internal links, which means if your blog post makes it to HN, it'll get crawled ahead of the tarpit. If it's discoverable and readable by actual humans, AI bots will be able to scrape it.

3. Blackthorn ◴[] No.42728175[source]
You've got to be seriously AI-drunk to equate letting your site be crawled by commercial scrapers with "contributing to humanity".

Maybe you don't want your your stuff to get thrown into the latest silicon valley commercial operation without getting paid for it. That seems like a valid position to take. Or maybe you just don't want Claude's ridiculously badly behaved scraper to chew through your entire budget.

Regardless, scrapers that don't follow the rules like robots.txt pretty quickly will discover why those rules exist in the first place as they receive increasing amounts of garbage.