←back to thread

211 points CrankyBear | 1 comments | | HN request time: 0.199s | source
Show context
k310 ◴[] No.45105673[source]
> Cloud services company Fastly agrees. It reports that 80% of all AI bot traffic comes from AI data fetcher bots.

No kidding. An increasing number of sites are putting up CAPTCHA's.

Problem? CAPTCHAS are annoying, they're a 50 times a day eye exam, and

> Google's reCAPTCHA is not only useless, it's also basically spyware [0]

> reCAPTCHA v3's checkbox test doesn't stop bots and tracks user data

[0] https://www.techspot.com/news/106717-google-recaptcha-not-on...

replies(5): >>45105796 #>>45106416 #>>45106468 #>>45106701 #>>45110554 #
1. ccgreg ◴[] No.45106701[source]
The Fastly report[1] has a couple of great quotes that mention Common Crawl's CCBot:

> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.

...

> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.

And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:

> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.

1: https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat...

2: https://www.theregister.com/2025/08/21/ai_crawler_traffic/