AI scrapers request commented scripts

(cryptography.dog)

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

grayhatter ◴[31 Oct 25 23:10 UTC] No.45777727[source]▶

>>45775259 #

If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

replies(2): >>45778691 #>>45780948 #

1. batch12 ◴[01 Nov 25 02:05 UTC] No.45778691[source]▶

>>45777727 #

Browser user agents have a history of being lies from the earliest days of usage. Official browsers lied about what they were- and still do.

replies(2): >>45779898 #>>45782316 #

2. jraph ◴[01 Nov 25 07:44 UTC] No.45779898[source]▶

>>45778691 (TP) #

Lies in user agent strings where for bypassing bugs, poor workarounds and assumptions that became wrong, they are nothing like what we are talking about.

replies(2): >>45781047 #>>45782304 #

3. gkbrk ◴[01 Nov 25 12:13 UTC] No.45781047[source]▶

>>45779898 #

A server returning HTML for Chrome but not cURL seems like a bug, no?

This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

replies(1): >>45782242 #

4. grayhatter ◴[01 Nov 25 15:09 UTC] No.45782242{3}[source]▶

>>45781047 #

> A server returning HTML for Chrome but not cURL seems like a bug, no?

tell me you've never heard of https://wttr.in/ without telling me. :P

It would absolutely be a bug iff this site returned html to curl.

> This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

This is a shallow take, the best counter example is how googlebot has no problem identifying it itself both in and out of thue user agent. Do note user agent packing, is distinctly different from a fake user agent selected randomly from the list of most common.

The existence of many libraries with the intent to help conceal the truth about a request doesn't feel like proof that's what everyone should be doing. It feels more like proof that most people only want to serve traffic to browsers and real users. And it's the bots and scripts that are the fuckups.

replies(1): >>45782359 #

5. batch12 ◴[01 Nov 25 15:15 UTC] No.45782304[source]▶

>>45779898 #

Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent. Exact same thing we are talking about. The difference is that people don't want companies to profit off of their content. That's fair. In this case, they should maybe consider some form of real authentication, or if the bot is abusive, some kind of rate limiting control.

replies(2): >>45782398 #>>45783429 #

6. grayhatter ◴[01 Nov 25 15:17 UTC] No.45782316[source]▶

>>45778691 (TP) #

Can you give a single example of a browser with a user agent that lies about it's real origin?

The best I can come up with is the TOR browser, which will reduce the number of bits of information it will return, but I dont consider that to be misleading. It's a custom build of firefox, that discloses it is firefox, and otherwise behaves exactly as I would expect firefox to behave.

7. batch12 ◴[01 Nov 25 15:22 UTC] No.45782359{4}[source]▶

>>45782242 #

Googlebot has no problem identifying itself because Google knows that you want it to index your site if you want visitors. It doesn't identify itself to give you the option to block it. It identifies itself so you don't.

replies(1): >>45782479 #

8. grayhatter ◴[01 Nov 25 15:26 UTC] No.45782398{3}[source]▶

>>45782304 #

> Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent.

I would actually argue, it's not nearly the same type of misconfiguration. The reason scripts, which have never been a browser, who omit their real identity, are doing it, is to evade bot detection. The reason browsers pack their UA with so much legacy data, is because of misconfigured servers. The server owner wants to send data to users and their browsers, but through incompetence, they've made a mistake. Browsers adapted by including extra strings in the UA to account for the expectations of incorrectly configured servers. Extra strings being the critical part, Google bot's UA is an example of this being done correctly.

9. grayhatter ◴[01 Nov 25 15:34 UTC] No.45782479{5}[source]▶

>>45782359 #

I care much less about being indexed by Google as much as you might think.

Google bot doesn't get blocked from my server primarily because it's a *very* well behaved bot. It sends a lot of requests, but it's very kind, and has never acted in a way that could overload my server. It respects robots.txt, and identifies itself multiple times.

Google bot doesn't get blocked, because it's a well behaved bot that eagerly follows the rules. I wouldn't underestimate how far that goes towards the reason it doesn't get blocked. Much more than the power gained by being google search.

10. jraph ◴[01 Nov 25 17:20 UTC] No.45783429{3}[source]▶

>>45782304 #

Add "assumptions that became wrong" to "intended" and the perspective radically changes, to the point that omitting this part from my comment changes everything.

I would even add:

> the client wanted the server to deliver content it had intended for a different client

In most cases, the webmaster intended their work to look good, not really to send different content to different clients. That later part is a technical means, a workaround. The intent of bringing the ok version to the end user was respected… even better with the user agent lies!

> The difference is that people don't want companies to profit off of their content.

Indeed¹, and also they don't want terrible bot to bring down their servers.

1: well, my open source work explicitly allows people to profit off of it - as long as the license is respected (attribution, copyleft, etc)

↑