Most active commenters
  • NegativeK(3)
  • kiitos(3)

←back to thread

597 points classichasclass | 13 comments | | HN request time: 0.001s | source | bottom
Show context
bob1029 ◴[] No.45011628[source]
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

replies(7): >>45011652 #>>45011830 #>>45011850 #>>45012424 #>>45012462 #>>45015038 #>>45015451 #
phito ◴[] No.45011652[source]
My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment
replies(7): >>45011694 #>>45011816 #>>45011999 #>>45013533 #>>45013955 #>>45014807 #>>45025114 #
kiitos ◴[] No.45014807[source]
every single IPv4 address in existence receives constant malicious traffic, from uncountably many malicious actors, on all common service ports (80, 443, 22, etc.) and, for HTTP specifically, to an enormous and growing number of common endpoints (mostly WordPress related, last I checked)

if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else

you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else

also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible

replies(2): >>45015080 #>>45015487 #
1. NegativeK ◴[] No.45015080[source]
I've heard this point raised elsewhere, and I think it's underplaying the magnitude of the issue.

Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.

The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.

AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.

replies(3): >>45015235 #>>45018955 #>>45045233 #
2. kiitos ◴[] No.45015235[source]
this is a completely fair point, it may be the case that AI scraper bots have recently made the magnitude and/or details of unwanted bot traffic to public IP addresses much worse

but yeah the issue is that as long as you have something accessible to the public, it's ultimately your responsibility to deal with malicious/aggressive traffic

> At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.

I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward

replies(1): >>45059613 #
3. 0x457 ◴[] No.45018955[source]
So weird to scrape wikipedia when you can just download db dumb from them.
replies(2): >>45019746 #>>45029833 #
4. xp84 ◴[] No.45019746[source]
Really makes you think about the calibre of minds being applied to buzzy problem spaces these days, doesn't it?
replies(1): >>45020373 #
5. socalgal2 ◴[] No.45020373{3}[source]
do we know they didn't download the DB? Maybe the new traffic is the LLM reading the site? (not the training)

I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.

Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.

replies(1): >>45031480 #
6. nitwit005 ◴[] No.45029833[source]
When you have a pile of funding, and you get told to do things quickly.
replies(1): >>45031767 #
7. xp84 ◴[] No.45031480{4}[source]
I would guess site operators can tell the difference between an exhaustive crawl and the targeted specific traffic I'd expect to see from an LLM checking sources on-demand. For one thing, the latter would have time-based patterns attributable to waking hours in the relevant parts of the world, whereas the exhaustive crawl traffic would probably be pretty constant all day and night.

Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.

8. 0x457 ◴[] No.45031767{3}[source]
But the correct way (getting a sql dump) is faster?
replies(1): >>45033424 #
9. nitwit005 ◴[] No.45033424{4}[source]
Had to get the web scraper working for other websites.
10. BlueTemplar ◴[] No.45045233[source]
From your example (and many others), AI companies are engaging in DDoS too, so why wouldn't law enforcement target them too ?
replies(1): >>45059599 #
11. NegativeK ◴[] No.45059599[source]
As a first and very pessimistic guess, the pages getting DoSed are maintained by people or groups with pretty minimal resources. That means time or money available for lawyers isn't there, and the monetary impact per website is small enough that LE may not care.

Also, they might share the common viewpoint of "it's the internet; suck it up."

12. NegativeK ◴[] No.45059613[source]
> I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward

Kinda my point was that it's only the internet being the internet if we tolerate it. If enough people give a crap, the corporations doing it will have to knock it off.

replies(1): >>45069626 #
13. kiitos ◴[] No.45069626{3}[source]
i appreciate the sentiment but no amount of people giving a crap will ever impact the stuff we're talking about here, because the stuff we're talking about here is in no way governed or influenced by popular opinion or anything even remotely adjacent to popular opinion

if you wanna rage against the machine then more power to you but this line of thinking is dead on arrival in terms of outcome