AI companies cause most of traffic on forums

1. johng ◴[30 Dec 24 14:40 UTC] No.42549646[source]▶

If they ignore robots.txt there should be some kind of recourse :(

replies(5): >>42550889 #>>42550903 #>>42550975 #>>42551033 #>>42551045 #

2. nathanaldensr ◴[30 Dec 24 16:42 UTC] No.42550903[source]▶

Sadly, as the slide from high-trust society to low-trust society continues, doing "the right thing" becomes less and less likely.

3. varelse ◴[30 Dec 24 16:45 UTC] No.42550933[source]▶

>>42550889 #

It's certainly one of the few things that actually gets their attention. But aren't there more important things than this for the Luigis among us?

I would suspect there's good money in offering a service to detect AI content on all of these forums and reject it. That will then be used as training data to refine them which gives such a service infinite sustainability.

replies(1): >>42551013 #

4. exe34 ◴[30 Dec 24 16:48 UTC] No.42550975[source]▶

>>42549646 (TP) #

zip b*mbs?

replies(1): >>42551098 #

5. Kapura ◴[30 Dec 24 16:51 UTC] No.42551013{3}[source]▶

>>42550933 #

>I would suspect there's good money in offering a service to detect AI content on all of these forums and reject it

This sounds like the cheater/anti-cheat arms race in online multiplayer games. Cheat developers create something, the anti-cheat teams create a method to detect and reject the exploit, a new cheat is developed, and the cycle continues. But this is much lower stakes than AI trying to vacuum up all of human expression, or trick real humans into wasting their time talking to computers.

replies(1): >>42559034 #

6. Neil44 ◴[30 Dec 24 16:53 UTC] No.42551033[source]▶

>>42549646 (TP) #

Error 403 is your only recourse.

replies(4): >>42551354 #>>42551414 #>>42551477 #>>42551878 #

7. stainablesteel ◴[30 Dec 24 16:54 UTC] No.42551045[source]▶

>>42549646 (TP) #

court ruling a few years ago said it's legal to scrape web pages, you don't need to be respectful of these for any purely legal reasons

however this doesn't stop the website from doing what they can to stop scraping attempts, or using a service to do that for them

replies(1): >>42551101 #

8. brookst ◴[30 Dec 24 17:00 UTC] No.42551098[source]▶

>>42550975 #

Assuming there is at least one already linked somewhere on the web, the crawlers already have logic to handle these.

replies(1): >>42551315 #

9. yodsanklai ◴[30 Dec 24 17:00 UTC] No.42551101[source]▶

>>42551045 #

> court ruling

Isn't this country dependent though?

replies(3): >>42551373 #>>42551880 #>>42552146 #

10. exe34 ◴[30 Dec 24 17:21 UTC] No.42551315{3}[source]▶

>>42551098 #

if you can detect them, maybe feed them low iq stuff from a small llama. add latency to waste their time.

replies(1): >>42551638 #

11. jprete ◴[30 Dec 24 17:24 UTC] No.42551354[source]▶

>>42551033 #

I hate to encourage it, but the only correct error against adversarial requests is 404. Anything else gives them information that they'll try to use against you.

12. lonelyParens ◴[30 Dec 24 17:26 UTC] No.42551373{3}[source]▶

>>42551101 #

don't you know everyone on the internet is American

13. lowbloodsugar ◴[30 Dec 24 17:30 UTC] No.42551414[source]▶

>>42551033 #

Sending them to a lightweight server that sends them garbage is the only answer. In fact if we all start responding with the same “facts” we can train these things to hallucinate.

14. geraldcombs ◴[30 Dec 24 17:37 UTC] No.42551477[source]▶

>>42551033 #

We return 402 (payment required) for one of our affected sites. Seems more appropriate.

15. brookst ◴[30 Dec 24 17:49 UTC] No.42551638{4}[source]▶

>>42551315 #

It would cost you more than it costs them. And there is enough low IQ stuff from humans that they already do tons of data cleaning.

replies(1): >>42552248 #

16. DannyBee ◴[30 Dec 24 18:10 UTC] No.42551878[source]▶

>>42551033 #

The right move is transferring data to them as slow as possible.

Even if you 403 them, do it as slow as possible.

But really I would infinitely 302 them as slow as possible.

17. stainablesteel ◴[30 Dec 24 18:10 UTC] No.42551880{3}[source]▶

>>42551101 #

yes! good point, you may be able to skirt around rules with a VPN if you're imposed by any

18. Aeolun ◴[30 Dec 24 18:35 UTC] No.42552146{3}[source]▶

>>42551101 #

Enforcement is not. What does the US care for what an EU court says about the legality of the OpenAI scraper.

replies(2): >>42552543 #>>42553340 #

19. sangnoir ◴[30 Dec 24 18:45 UTC] No.42552248{5}[source]▶

>>42551638 #

> And there is enough low IQ stuff from humans that they already do tons of data cleaning

Whatever cleaning they do is not effective, simply because it cannot scale with the sheer volumes if data they ingest. I had an LLM authoritatively give an incorrect answer, and when I followed up to the source, it was from a fanfic page.

Everyone ITT who's being told to give up because its hopeless to defend against AI scrapers - you're being propagandized, I won't speculate on why - but clearly this is an arms race with no clear winner yet. Defenders are free to use LLM to generate chaff.

20. yodsanklai ◴[30 Dec 24 19:23 UTC] No.42552543{4}[source]▶

>>42552146 #

I understand there's a balance of power, but I was under the impression that US tech companies were taking EU regulations seriously.

21. okanat ◴[30 Dec 24 20:49 UTC] No.42553340{4}[source]▶

>>42552146 #

They can charge the company continuously growing amounts in the EU and even ban a complete IP block if they don't fix their behavior.