AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 3 comments | 31 Oct 25 15:44 UTC | HN request time: 0s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

grayhatter ◴[31 Oct 25 23:10 UTC] No.45777727[source]▶

>>45775259 #

If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

replies(2): >>45778691 #>>45780948 #

batch12 ◴[01 Nov 25 02:05 UTC] No.45778691[source]▶

>>45777727 #

Browser user agents have a history of being lies from the earliest days of usage. Official browsers lied about what they were- and still do.

replies(2): >>45779898 #>>45782316 #

jraph ◴[01 Nov 25 07:44 UTC] No.45779898[source]▶

>>45778691 #

Lies in user agent strings where for bypassing bugs, poor workarounds and assumptions that became wrong, they are nothing like what we are talking about.

replies(2): >>45781047 #>>45782304 #

gkbrk ◴[01 Nov 25 12:13 UTC] No.45781047[source]▶

>>45779898 #

A server returning HTML for Chrome but not cURL seems like a bug, no?

This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

replies(1): >>45782242 #

1. grayhatter ◴[01 Nov 25 15:09 UTC] No.45782242[source]▶

>>45781047 #

> A server returning HTML for Chrome but not cURL seems like a bug, no?

tell me you've never heard of https://wttr.in/ without telling me. :P

It would absolutely be a bug iff this site returned html to curl.

> This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

This is a shallow take, the best counter example is how googlebot has no problem identifying it itself both in and out of thue user agent. Do note user agent packing, is distinctly different from a fake user agent selected randomly from the list of most common.

The existence of many libraries with the intent to help conceal the truth about a request doesn't feel like proof that's what everyone should be doing. It feels more like proof that most people only want to serve traffic to browsers and real users. And it's the bots and scripts that are the fuckups.

replies(1): >>45782359 #

2. batch12 ◴[01 Nov 25 15:22 UTC] No.45782359[source]▶

>>45782242 (TP) #

Googlebot has no problem identifying itself because Google knows that you want it to index your site if you want visitors. It doesn't identify itself to give you the option to block it. It identifies itself so you don't.

replies(1): >>45782479 #

3. grayhatter ◴[01 Nov 25 15:34 UTC] No.45782479[source]▶

>>45782359 #

I care much less about being indexed by Google as much as you might think.

Google bot doesn't get blocked from my server primarily because it's a *very* well behaved bot. It sends a lot of requests, but it's very kind, and has never acted in a way that could overload my server. It respects robots.txt, and identifies itself multiple times.

Google bot doesn't get blocked, because it's a well behaved bot that eagerly follows the rules. I wouldn't underestimate how far that goes towards the reason it doesn't get blocked. Much more than the power gained by being google search.

↑