Nepenthes is a tarpit to catch AI web crawlers

(zadzmo.org)

714 points blendergeek | 5 comments | 16 Jan 25 13:57 UTC | HN request time: 1.005s | source

Show context

hartator ◴[16 Jan 25 14:52 UTC] No.42725964[source]▶

There are already “infinite” websites like these on the Internet.

Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.

Unknown websites will get very few crawls per day whereas popular sites millions.

Source: I am the CEO of SerpApi.

replies(9): >>42726093 #>>42726258 #>>42726572 #>>42727553 #>>42727737 #>>42727760 #>>42728210 #>>42728522 #>>42742537 #

dawnerd ◴[16 Jan 25 16:36 UTC] No.42727553[source]▶

>>42725964 #

Looking at my logs for all of my sites and this isn’t a global truth. I see multiple ai crawlers hammering away requesting the same pages many, many times. Perplexity and Facebook are basically nonstop.

replies(2): >>42727843 #>>42728930 #

1. jonatron ◴[16 Jan 25 17:02 UTC] No.42727843[source]▶

>>42727553 #

I just looked at the logs for a site, and I saw PerplexityBot is looking at the robots.txt and ignoring it. They don't provide a list of IPs to verify if it is actually them. Anyway, just for anyone with PerplexityBot in their user agent, they can get increasingly bad responses until the abuse stops.

replies(1): >>42728835 #

2. dawnerd ◴[16 Jan 25 18:18 UTC] No.42728835[source]▶

>>42727843 (TP) #

Perplexity is exceptionally bad because they say they respect the robots.txt but clearly don't. When pressed on it they basically shrug and say too bad not put stuff in public if you don't want it crawled. They got a UA block in cloudflare and seems like that did the trick.

replies(2): >>42729201 #>>42732307 #

3. Dwedit ◴[16 Jan 25 18:45 UTC] No.42729201[source]▶

>>42728835 #

User Agent block just means they'd spoof their user agent.

replies(1): >>42747247 #

4. TeMPOraL ◴[16 Jan 25 23:32 UTC] No.42732307[source]▶

>>42728835 #

Interesting. Now they seem to claim that not only they follow robots.txt for crawling, but that they also broke under pressure and made the unfortunate decisions to have user requests follow robots.txt too.

https://www.perplexity.ai/de/hub/technical-faq/how-does-perp...

5. marginalia_nu ◴[18 Jan 25 10:11 UTC] No.42747247{3}[source]▶

>>42729201 #

That generally gives you even more trouble with cloudflare. Behaving in any way inconsistent with your UA string is one of the easiest methods of identifying bots.

Yeah you can use headless browsers, but then you're also using orders of magnitude more compute, and that's not really something that scales.

The best way to avoid ending up in captcha-land is to say who you are, and respect robots.txt.

↑