AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 1 comments | 31 Oct 25 15:44 UTC | HN request time: 0.203s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

grayhatter ◴[31 Oct 25 23:10 UTC] No.45777727[source]▶

>>45775259 #

If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

replies(2): >>45778691 #>>45780948 #

batch12 ◴[01 Nov 25 02:05 UTC] No.45778691[source]▶

>>45777727 #

Browser user agents have a history of being lies from the earliest days of usage. Official browsers lied about what they were- and still do.

replies(2): >>45779898 #>>45782316 #

jraph ◴[01 Nov 25 07:44 UTC] No.45779898[source]▶

>>45778691 #

Lies in user agent strings where for bypassing bugs, poor workarounds and assumptions that became wrong, they are nothing like what we are talking about.

replies(2): >>45781047 #>>45782304 #

batch12 ◴[01 Nov 25 15:15 UTC] No.45782304[source]▶

>>45779898 #

Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent. Exact same thing we are talking about. The difference is that people don't want companies to profit off of their content. That's fair. In this case, they should maybe consider some form of real authentication, or if the bot is abusive, some kind of rate limiting control.

replies(2): >>45782398 #>>45783429 #

1. grayhatter ◴[01 Nov 25 15:26 UTC] No.45782398[source]▶

>>45782304 #

> Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent.

I would actually argue, it's not nearly the same type of misconfiguration. The reason scripts, which have never been a browser, who omit their real identity, are doing it, is to evade bot detection. The reason browsers pack their UA with so much legacy data, is because of misconfigured servers. The server owner wants to send data to users and their browsers, but through incompetence, they've made a mistake. Browsers adapted by including extra strings in the UA to account for the expectations of incorrectly configured servers. Extra strings being the critical part, Google bot's UA is an example of this being done correctly.

↑