AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 3 comments | 31 Oct 25 15:44 UTC | HN request time: 0.661s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

grayhatter ◴[31 Oct 25 23:10 UTC] No.45777727[source]▶

>>45775259 #

If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

replies(2): >>45778691 #>>45780948 #

wqaatwt ◴[01 Nov 25 11:52 UTC] No.45780948[source]▶

>>45777727 #

Somebody concealing or obfuscating various information a browser would send by standard for privacy or other reasons is also “lying” by that standard? Or someone using a VPN?

replies(1): >>45782152 #

grayhatter ◴[01 Nov 25 15:00 UTC] No.45782152[source]▶

>>45780948 #

Someone using a VPN is not lying. The intent of a user agent is to identify the software sending the request. The IP address isn't sent by the browser, and isn't part of the HTTP request. It's part of the routing information required to deliver the packet back to the client. If a client sent it's "real" IP address as an HTTP header, and I tried to respond to that IP instead of the IP address from the TCP packet. It would never arrive.

There's a difference between sending no data, and sending false data. I don't block requests without http referrers for that very reason.

replies(1): >>45782227 #

wqaatwt ◴[01 Nov 25 15:08 UTC] No.45782227[source]▶

>>45782152 #

IIRC Firefox (and I assume other browsers) when using privacy/no tracking mode does send fake data..

replies(1): >>45782332 #

1. grayhatter ◴[01 Nov 25 15:19 UTC] No.45782332[source]▶

>>45782227 #

You're incorrect. I've never seen any browser, on it's own lie about it's user agent. (I can set a custom string and lie with it, but that's not the agent doing it)

Do you have a specific / concrete example in mind? Or are you mistaking a feature from something other than a mainstream browser?

replies(1): >>45785754 #

2. gkbrk ◴[01 Nov 25 21:54 UTC] No.45785754[source]▶

>>45782332 (TP) #

Firefox sends an incorrect version and operating system on its User-Agent when the privacy settings are turned on.

IIRC it defaults to a Windows user agent even when you use it on other operating systems.

replies(1): >>45788228 #

3. grayhatter ◴[02 Nov 25 06:30 UTC] No.45788228[source]▶

>>45785754 #

You're incorrect. I have Firefox configured with the most strict privacy settings, and it returns `Mozilla/5.0 (X11; Linux x86_64; rv:142.0) Gecko/20100101 Firefox/142.0)` With the exception of it being Wayland instead of X11 it's entirely accurate. Would love to see whatever gaslit you of something so easy to test and validate.

↑