Nepenthes is a tarpit to catch AI web crawlers

(zadzmo.org)

714 points blendergeek | 3 comments | 16 Jan 25 13:57 UTC | HN request time: 0.631s | source

Show context

hartator ◴[16 Jan 25 14:52 UTC] No.42725964[source]▶

There are already “infinite” websites like these on the Internet.

Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.

Unknown websites will get very few crawls per day whereas popular sites millions.

Source: I am the CEO of SerpApi.

replies(9): >>42726093 #>>42726258 #>>42726572 #>>42727553 #>>42727737 #>>42727760 #>>42728210 #>>42728522 #>>42742537 #

1. angoragoats ◴[16 Jan 25 17:56 UTC] No.42728522[source]▶

>>42725964 #

This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.

replies(1): >>42731636 #

2. marginalia_nu ◴[16 Jan 25 22:18 UTC] No.42731636[source]▶

>>42728522 (TP) #

These things are so common having some way of dealing with them is basically mandatory if you plan on doing any sort of large scale crawling.

That said, crawlers are fairly bug prone, so misbehaving crawlers is also a relatively common sight. It's genuinely difficult to properly test a crawler, and useless to build it from specs, since the realities of the web are so far off the charted territory, any test you build is testing against something that's far removed from what you'll actually encounter. With real web data, the corner cases have corner cases, and the HTTP and HTML specs are but vague suggestions.

replies(1): >>42732129 #

3. angoragoats ◴[16 Jan 25 23:12 UTC] No.42732129[source]▶

>>42731636 #

I am aware of all of the things you mention (I've built crawlers before).

My point was only that there are plenty of crawlers that don't operate in the way the parent post described. If you want to call them buggy that's fine.

↑