AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 5 comments | 31 Oct 25 15:44 UTC | HN request time: 1.221s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

jraph ◴[31 Oct 25 20:14 UTC] No.45776210[source]▶

>>45775259 #

When I open an HTTP server to the public web, I expect and welcome GET requests in general.

However,

(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.

(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)

(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.

Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".

replies(3): >>45776825 #>>45778554 #>>45779457 #

1. Aloisius ◴[01 Nov 25 01:35 UTC] No.45778554[source]▶

>>45776210 #

> Well behaved robots do not usually use millions of residential IPs

Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).

Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.

Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.

While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.

replies(1): >>45778597 #

2. kijin ◴[01 Nov 25 01:44 UTC] No.45778597[source]▶

>>45778554 (TP) #

Occasionally fetching a link will probably go unnoticed.

If your antivirus software hammers the same website several times a second for hours on end, in a way that is indistinguishable from an "AI crawler", then maybe it's really misbehaving and should be stopped from doing so.

replies(1): >>45778846 #

3. Aloisius ◴[01 Nov 25 02:42 UTC] No.45778846[source]▶

>>45778597 #

Legitimate software that scan links are often well behaved, in isolation. It's when that software is installed on millions of computers that in aggregate, they can behave poorly. This isn't particularly new though. RSS software used to blow up small websites that couldn't handle it. Now with some browsers speculatively loading links, you can be hammered simply because you're linked to from a popular site even if no one actually clicks on the link.

Personally, I'm skeptical of blaming everything on AI scrapers. Everything people are complaining about has been happening for decades - mostly by people searching for website vulnerabilities/sensitive info who don't care if they're misbehaving, sometimes by random individuals who want to archive a site or are playing with a crawler and don't see why they should slow them down.

Even the techniques for poisoning aggressive or impolite crawlers are at least 30 years old.

replies(1): >>45779273 #

4. kijin ◴[01 Nov 25 04:34 UTC] No.45779273{3}[source]▶

>>45778846 #

Yes, and sysadmins have been quietly banning those misbehaving programs for the last 30 years.

The only thing that seems to have changed is that today's thread is full of people who think they have some sort of human right to access any website by any means possible, including their sloppy vibe-coded crawler. In the past, IIRC, people used to be a little more apologetic about consuming other people's resources and did their best to fly below the radar.

It's my website. I have every right to block anyone at any time for any reason whatsoever. Whether or not your use case is "legitimate" is beside the point.

replies(1): >>45783426 #

5. ToucanLoucan ◴[01 Nov 25 17:19 UTC] No.45783426{4}[source]▶

>>45779273 #

The entitlement of so many modern vibe coders (or as we called them before, script kiddies) is absolutely off the charts. Just because there is not a rule or law expressly against what you're doing doesn't mean it's perfectly fine to do. Websites are hosted by and funded by people, and if your shitty scraper racks up a ton of traffic on one of my sites, I may end up on the hook for that. I am perfectly within both my rights and ethical boundaries to block your IP(s).

And just to not leave it merely implied, I don't give a rats ass if that slows down your "innovation." Go away.

↑