AI companies cause most of traffic on forums

1. mentalgear ◴[30 Dec 24 17:42 UTC] No.42551541[source]▶

Note-worthy from the article (as some commentators suggested blocking them).

"If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet."

replies(5): >>42551717 #>>42551976 #>>42552122 #>>42552700 #>>42552885 #

2. optimalsolver ◴[30 Dec 24 17:56 UTC] No.42551717[source]▶

>>42551541 (TP) #

Ban evasion for me, but not for thee.

3. IanKerr ◴[30 Dec 24 18:18 UTC] No.42551976[source]▶

>>42551541 (TP) #

This is the beginning of the end of the public internet, imo. Websites that aren't able to manage the bandwidth consumption of AI scrapers and the endless spam that will take over from LLMs writing comments on forums are going to go under. The only things left after AI has its way will be walled gardens with whitelisted entrants or communities on large websites like Facebook. Niche, public sites are going to become unsustainable.

replies(2): >>42552793 #>>42553743 #

4. loeg ◴[30 Dec 24 18:32 UTC] No.42552122[source]▶

>>42551541 (TP) #

I'd kind of like to see that claim substantiated a little more. Is it all crawlers that switch to a non-bot UA, or how are they determining it's the same bot? What non-bot UA do they claim?

replies(3): >>42552172 #>>42552177 #>>42555570 #

5. denschub ◴[30 Dec 24 18:37 UTC] No.42552172[source]▶

>>42552122 #

> Is it all crawlers that switch to a non-bot UA

I've observed only one of them do this with high confidence.

> how are they determining it's the same bot?

it's fairly easy to determine that it's the same bot, because as soon as I blocked the "official" one, a bunch of AWS IPs started crawling the same URL patterns - in this case, mediawiki's diff view (`/wiki/index.php?title=[page]&diff=[new-id]&oldid=[old-id]`), that absolutely no bot ever crawled before.

> What non-bot UA do they claim?

Latest Chrome on Windows.

replies(1): >>42553872 #

6. untitaker_ ◴[30 Dec 24 18:38 UTC] No.42552177[source]▶

>>42552122 #

Presumably they switch UA to Mozilla/something but tell on themselves by still using the same IP range or ASN. Unfortunately this has become common practice for feed readers as well.

7. aaroninsf ◴[30 Dec 24 19:44 UTC] No.42552700[source]▶

>>42551541 (TP) #

I instigated `user-agent`-based rate limiting for exactly this reason, exactly this case.

These bots were crushing our search infrastructure (which is tightly coupled to our front end).

8. raphman ◴[30 Dec 24 19:56 UTC] No.42552793[source]▶

>>42551976 #

Yeah. Our research group has a wiki with (among other stuff) a list of open, completed, and ongoing bachelor's/master's theses. Until recently, the list was openly available. But AI bots caused significant load by crawling each page hundreds of times, following all links to tags (which are implemented as dynamic searches), prior revisions, etc. Since a few weeks, the pages are only available to authenticated users.

9. pacifika ◴[30 Dec 24 20:04 UTC] No.42552885[source]▶

>>42551541 (TP) #

So you get all the IPs by rate limiting them?

10. oblio ◴[30 Dec 24 21:26 UTC] No.42553743[source]▶

>>42551976 #

Classic spam all but killed small email hosts, AI spam will kill off the web.

Super sad.

11. loeg ◴[30 Dec 24 21:41 UTC] No.42553872{3}[source]▶

>>42552172 #

Thanks.

12. alphan0n ◴[31 Dec 24 01:47 UTC] No.42555570[source]▶

>>42552122 #

I would take anything the author said with a grain of salt. They straight up lied about the configuration of the robots.txt file.

https://news.ycombinator.com/item?id=42551628

replies(2): >>42563001 #>>42567297 #

13. mplewis ◴[01 Jan 25 00:39 UTC] No.42563001{3}[source]▶

>>42555570 #

What is causing you to be so unnecessarily aggressive?

replies(1): >>42563372 #

14. alphan0n ◴[01 Jan 25 01:57 UTC] No.42563372{4}[source]▶

>>42563001 #

Liars should be called out, necessarily. Intellectual dishonesty is cancer. I could be more aggressive if it were something that really mattered.

replies(1): >>42563585 #

15. nkrisc ◴[01 Jan 25 02:51 UTC] No.42563585{5}[source]▶

>>42563372 #

Lying requires intent to deceive. How have you determined their intent?

replies(2): >>42563774 #>>42563827 #

16. n144q ◴[01 Jan 25 03:34 UTC] No.42563774{6}[source]▶

>>42563585 #

> Lying requires intent to deceive

Since when do we ask people to guess other people's intent when they have better things to show, which is called evidence?

Surely we should talk about things with substantiated matter?

replies(1): >>42565949 #

17. alphan0n ◴[01 Jan 25 03:48 UTC] No.42563827{6}[source]▶

>>42563585 #

When someone says:

> Oh, and of course, they don't just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don't give a single flying fuck about robots.txt, because why should they.

Their self righteous indignation and specificity of the pretend subject of that indignation precludes any doubt about intent.

This guy made a whole public statement that is verifiably false. And then tried to toddler logic it away when he got called out.

replies(1): >>42565944 #

18. nkrisc ◴[01 Jan 25 13:41 UTC] No.42565944{7}[source]▶

>>42563827 #

That may all be true. That still doesn’t mean they intentionally lied.

replies(1): >>42569607 #

19. nkrisc ◴[01 Jan 25 13:41 UTC] No.42565949{7}[source]▶

>>42563774 #

Because there’s a meaningful difference between being wrong and lying.

There’s evidence the statement was false, no evidence it was a lie.

20. ribadeo ◴[01 Jan 25 17:01 UTC] No.42567297{3}[source]▶

>>42555570 #

How do you know what the contextual configuration of their robots.txt is/was?

Your accusation was directly addressed by the author in a comment on the original post, IIRC

i find your attitude as expressed here to be problematic in many ways

replies(1): >>42569521 #

21. alphan0n ◴[01 Jan 25 21:37 UTC] No.42569521{4}[source]▶

>>42567297 #

CommonCrawl archives robots.txt

For convenience, you can view the extracted data here:

https://pastebin.com/VSHMTThJ

You are welcome to verify for yourself by searching for “wiki.diasporafoundation.org/robots.txt” in the CommonCrawl index here:

https://index.commoncrawl.org/

The index contains a file name that you can append to the CommonCrawl url to download the archive and view.

More detailed information on downloading archives here:

https://commoncrawl.org/get-started

From September to December, the robots.txt at wiki.diasporafoundation.org contained this, and only this:

>User-agent: * >Disallow: /w/

Apologies for my attitude, I find defenders of the dishonest in the face of clear evidence even more problematic.

replies(1): >>42574617 #

22. alphan0n ◴[01 Jan 25 21:51 UTC] No.42569607{8}[source]▶

>>42565944 #

What is the criteria of an intentional lie, then? Admission?

The author responded:

>denschub 2 days ago [–]

>the robots.txt on the wiki is no longer what it was when the bot accessed it. primarily because I clean up my stuff afterwards, and the history is now completely inaccessible to non-authenticated users, so there's no need to maintain my custom robots.txt

Which is verifiably untrue:

HTTP/1.1 200 server: nginx/1.27.2 date: Tue, 10 Dec 2024 13:37:20 GMT content-type: text/plain last-modified: Fri, 13 Sep 2024 18:52:00 GMT etag: W/"1c-62204b7e88e25" alt-svc: h3=":443", h2=":443" X-Crawler-content-encoding: gzip Content-Length: 28

User-agent: * Disallow: /w/

replies(1): >>42584946 #

23. shkkmo ◴[02 Jan 25 14:19 UTC] No.42574617{5}[source]▶

>>42569521 #

Your attitude is inappropriate and violates the sitewide guidelines for discussion.

replies(1): >>42583213 #

24. alphan0n ◴[03 Jan 25 06:50 UTC] No.42583213{6}[source]▶

>>42574617 #

There are currently two references to “Mangion-ing” OpenAI board members in this thread, several more from Reddit, based on the falsehoods being perpetrated by the author. Is this really someone you want to conspire with? Is calling this out more egregious than the witch hunt being organized here?

replies(1): >>42587233 #

25. nkrisc ◴[03 Jan 25 12:09 UTC] No.42584946{9}[source]▶

>>42569607 #

> intentional lie

There are no “intentional” lies, because there are no “unintentional” lies.

All lies are intentional. An “unintentional lie” is better known as “being wrong”.

Being wrong isn’t always lying. What’s so hard about this? An example:

My wife once asked me if I had taken the trash out to the curb, and I said I had. This was demonstrably false, anyone could see I had not. Yet for whatever reason, I mistakenly believed that I had done it. I did not lie to her. I really believed I had done it. I was wrong.

replies(1): >>42593069 #

26. shkkmo ◴[03 Jan 25 16:53 UTC] No.42587233{7}[source]▶

>>42583213 #

"conspire" and "witch hunt", are not terms of productive discourse.

If you are legitimately trying to correct misinformation, your attitude, tone and language are counter productive. You would be much better seved by taking that energy and crafting an actually persuasive argument. You come across as unreasonable and unwilling to listen, not someone with a good grasp of the technical specifics.

I don't have a horse in the race. I'm fairly technical, but I did not find your arguments persuasive. This doesn't mean they are wrong, but it does mean that you didn't do a good job of explaining them.

27. alphan0n ◴[04 Jan 25 07:09 UTC] No.42593069{10}[source]▶

>>42584946 #

No worries, I understand. The author admitted to me that he was lying via DM, that he often does this for attention.