←back to thread

646 points blendergeek | 3 comments | | HN request time: 0s | source
Show context
hartator ◴[] No.42725964[source]
There are already “infinite” websites like these on the Internet.

Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.

Unknown websites will get very few crawls per day whereas popular sites millions.

Source: I am the CEO of SerpApi.

replies(9): >>42726093 #>>42726258 #>>42726572 #>>42727553 #>>42727737 #>>42727760 #>>42728210 #>>42728522 #>>42742537 #
1. angoragoats ◴[] No.42728522[source]
This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.
replies(1): >>42731636 #
2. marginalia_nu ◴[] No.42731636[source]
These things are so common having some way of dealing with them is basically mandatory if you plan on doing any sort of large scale crawling.

That said, crawlers are fairly bug prone, so misbehaving crawlers is also a relatively common sight. It's genuinely difficult to properly test a crawler, and useless to build it from specs, since the realities of the web are so far off the charted territory, any test you build is testing against something that's far removed from what you'll actually encounter. With real web data, the corner cases have corner cases, and the HTTP and HTML specs are but vague suggestions.

replies(1): >>42732129 #
3. angoragoats ◴[] No.42732129[source]
I am aware of all of the things you mention (I've built crawlers before).

My point was only that there are plenty of crawlers that don't operate in the way the parent post described. If you want to call them buggy that's fine.