Most active commenters

hinkley(7)
rollcat(4)
sidewndr46(4)
dang(4)
immibis(3)
kiitos(3)
NegativeK(3)
SoftTalker(3)
0x457(3)
xp84(3)

Popular/hot comments

>>45011652 #
>>45011816 #
>>45011747 #
>>45012424 #
>>45011694 #
>>45011999 #
>>45013284 #
>>45015080 #
>>45019598 #

←back to thread

Ban me at the IP level if you don't like me

(boston.conman.org)

1. bob1029 ◴[25 Aug 25 08:35 UTC] No.45011628[source]▶

>>45010183 (OP) #

I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

replies(7): >>45011652 #>>45011830 #>>45011850 #>>45012424 #>>45012462 #>>45015038 #>>45015451 #

2. phito ◴[25 Aug 25 08:40 UTC] No.45011652[source]▶

>>45011628 (TP) #

My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment

replies(7): >>45011694 #>>45011816 #>>45011999 #>>45013533 #>>45013955 #>>45014807 #>>45025114 #

3. dmesg ◴[25 Aug 25 08:46 UTC] No.45011694[source]▶

>>45011652 #

Yes and it makes reading your logs needlessly harder. Sometimes I find an odd password being probed, search for it on the web and find an interesting story, that a new backdoor was discovered in a commercial appliance.

In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.

replies(3): >>45011747 #>>45011811 #>>45012470 #

4. wvbdmp ◴[25 Aug 25 08:53 UTC] No.45011747{3}[source]▶

>>45011694 #

You log passwords?

replies(4): >>45013224 #>>45014657 #>>45014868 #>>45018054 #

5. ◴[25 Aug 25 09:04 UTC] No.45011811{3}[source]▶

>>45011694 #

6. bob1029 ◴[25 Aug 25 09:04 UTC] No.45011816[source]▶

>>45011652 #

Thousands of requests per hour? So, something like 1-3 per second?

If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.

Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.

replies(5): >>45014694 #>>45014705 #>>45015142 #>>45016540 #>>45019745 #

7. johnnyfaehell ◴[25 Aug 25 09:07 UTC] No.45011830[source]▶

>>45011628 (TP) #

While we may be smart, a lot of us are extremely pedantic about tech things. I think for many if they did nothing it would wind them up the wall while doing something the annoyance is smaller.

8. themafia ◴[25 Aug 25 09:11 UTC] No.45011850[source]▶

>>45011628 (TP) #

The way I get a fast web product is to pay a premium for data. So, no, it's not "lost time" by banning these entities, it's actual saved costs on my bandwidth and compute bills.

The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.

replies(2): >>45013979 #>>45017742 #

9. wraptile ◴[25 Aug 25 09:36 UTC] No.45011999[source]▶

>>45011652 #

> thousounds of requests an hour from bots

That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.

replies(3): >>45012092 #>>45016557 #>>45019778 #

10. q3k ◴[25 Aug 25 09:50 UTC] No.45012092{3}[source]▶

>>45011999 #

Serving file content/diff requests from gitea/forgejo is quite expensive computationally. And these bots tend to tarpit themselves when they come across eg. a Linux repo mirror.

https://social.hackerspace.pl/@q3k/114358881508370524

replies(2): >>45012546 #>>45015199 #

11. threeducks ◴[25 Aug 25 10:42 UTC] No.45012424[source]▶

>>45011628 (TP) #

> The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.

I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

replies(4): >>45012679 #>>45014295 #>>45020463 #>>45057178 #

12. ricardo81 ◴[25 Aug 25 10:49 UTC] No.45012462[source]▶

>>45011628 (TP) #

>an ideological game of capture the flag

I prefer the whack a mole analogy.

I've seen forums where people spend an inordinate amount of time identifying 'bad bots' for blocking, there'll always be more.

13. rollcat ◴[25 Aug 25 10:51 UTC] No.45012470{3}[source]▶

>>45011694 #

> Sometimes I find an odd password being probed, search for it on the web and find an interesting story [...].

Yeah, this is beyond irresponsible. You know the moment you're pwned, __you__ become the new interesting story?

For everyone else, use a password manager to pick a random password for everything.

replies(1): >>45012625 #

14. rollcat ◴[25 Aug 25 11:05 UTC] No.45012546{4}[source]▶

>>45012092 #

I think at this point every self-hosted forge should block diffs from anonymous users.

Also: Anubis and go-away, but also: some people are on old browsers or underpowered computers.

15. Thorrez ◴[25 Aug 25 11:21 UTC] No.45012625{4}[source]▶

>>45012470 #

What is beyond irresponsible? Monitoring logs and researching odd things found there?

replies(2): >>45013099 #>>45013232 #

16. rollcat ◴[25 Aug 25 11:28 UTC] No.45012679[source]▶

>>45012424 #

> I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.

Also: who's your webhost? $1/m sounds like a steal.

replies(1): >>45013992 #

17. JohnFen ◴[25 Aug 25 12:26 UTC] No.45013099{5}[source]▶

>>45012625 #

How are passwords ending up in your logs? Something is very, very wrong there.

replies(2): >>45013284 #>>45014850 #

18. ◴[25 Aug 25 12:43 UTC] No.45013224{4}[source]▶

>>45011747 #

19. rollcat ◴[25 Aug 25 12:44 UTC] No.45013232{5}[source]▶

>>45012625 #

The way to handle a password:

    plaintextPassword = POST["password"]
    ok = bcryptCompare(hashedPassword, plaintextPassword)
    // (now throw away POST and plaintextPassword)
    if (ok) { ... }

Bonus points: on user lookup, when no user is found, fetch a dummy hashedPassword, compare, and ignore the result. This will partially mitigate username enumeration via timing attacks.

replies(1): >>45014452 #

20. dmesg ◴[25 Aug 25 12:50 UTC] No.45013284{6}[source]▶

>>45013099 #

Does an attacking bot know your webserver is not a misconfigured router exposing its web interface to the net? I often am baffled what conclusions people come up with from half reading posts. I had bots attack me with SSH 2.0 login attempts on port 80 and 443. Some people underestimate how bad at computer science some skids are.

replies(3): >>45014297 #>>45015462 #>>45019689 #

21. ralferoo ◴[25 Aug 25 13:17 UTC] No.45013533[source]▶

>>45011652 #

What's worse is when you get bots blasting HTTP traffic at every open port, even well known services like SMTP. Seriously, it's a mail server. It identified itself as soon as the connection was opened, if they waited 100ms-300ms before spamming, they'd know that it wasn't HTTP because the other side wouldn't send anything at all if it was. There's literally no need to bombard a mail server on a well known port by continuing to send a load of junk that's just going to fill someone's log file.

replies(2): >>45014905 #>>45015499 #

22. immibis ◴[25 Aug 25 14:02 UTC] No.45013955[source]▶

>>45011652 #

I have a small public gitea instance that got thousands of requests per hour from bots.

I encountered exactly one actual problem: the temporary folder for zip snapshots filled up the disk since bots followed all snapshot links and it seems gitea doesn't delete generated snapshots. I made that directory read-only, deleted its contents, and the problem was solved, at the cost of only breaking zip snapshots.

I experienced no other problems.

I did put some user-agent checks in place a while later, but that was just for fun to see if AI would eventually ingest false information.

23. immibis ◴[25 Aug 25 14:04 UTC] No.45013979[source]▶

>>45011850 #

When you're a business that serves specific customers, it's justifiable to block everyone who isn't your customer. Complaints about overblocking are relevant to public sites, not yours.

24. immibis ◴[25 Aug 25 14:05 UTC] No.45013992{3}[source]▶

>>45012679 #

You can sometimes find special (loss-leader) deals in this range on LowEndTalk. Typically you'll have to pay upfront for a block of one or two years.

25. TylerE ◴[25 Aug 25 14:33 UTC] No.45014295[source]▶

>>45012424 #

It starting hitting endpoints that do lots of db thrashing, and it’s usually ones that are NOT common or recent so caching won’t save you.

Serving up a page that takes a few dozen db queries is a lot different than serving a static page.

26. socksy ◴[25 Aug 25 14:33 UTC] No.45014297{7}[source]▶

>>45013284 #

Also baffled that three separate people came to that conclusion. Do they not run web servers on the open web or something? Script kiddies are constantly probing urls, and urls come up in your logs. Sure it would be bad if that was how your app was architected. But it's not how it's architected, it's how the skids hope your app is architected. It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??

replies(2): >>45014904 #>>45019187 #

27. Sophira ◴[25 Aug 25 14:46 UTC] No.45014452{6}[source]▶

>>45013232 #

I believe you may have misinterpreted the comment. They're not talking about logs that were made from a login form on their website. They're talking about generic logs (sometimes not even web server logs) being generated because of bots that are attempting to find vulnerabilities on random pages. Pages that don't even exist or may not even be applicable on this server.

28. zeta0134 ◴[25 Aug 25 15:04 UTC] No.45014657{4}[source]▶

>>45011747 #

Just about nobody logs passwords on purpose. But really stupid IoT devices accept credentials as like query strings, or part of the path or something, and it's common to log those. The attacker is sending you passwords meant for a much less secure system.

replies(1): >>45015432 #

29. zeta0134 ◴[25 Aug 25 15:07 UTC] No.45014694{3}[source]▶

>>45011816 #

One of the most common issues we helped customers solve when I worked in web hosting was low disk alerts, usually because the log rotation had failed. Often the content of those logs was exactly this sort of nonsense and had spiked recently due to a scraper. The sheer size of the logs can absolutely be a problem on a smaller server, which is more and more common now that the inexpensive server is often a VM or a container.

30. tedivm ◴[25 Aug 25 15:08 UTC] No.45014705{3}[source]▶

>>45011816 #

Depending on what they're actually pulling down this can get pretty expensive. Bandwidth isn't free.

31. kiitos ◴[25 Aug 25 15:18 UTC] No.45014807[source]▶

>>45011652 #

every single IPv4 address in existence receives constant malicious traffic, from uncountably many malicious actors, on all common service ports (80, 443, 22, etc.) and, for HTTP specifically, to an enormous and growing number of common endpoints (mostly WordPress related, last I checked)

if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else

you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else

also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible

replies(2): >>45015080 #>>45015487 #

32. hvb2 ◴[25 Aug 25 15:22 UTC] No.45014850{6}[source]▶

>>45013099 #

If the caller puts it in the query string and you log that? It doesn't have to be valid in your application to make an attacker pass it in.

So unless you're not logging your request path/query string you're doing something very very wrong by your own logic :). I can't imagine diagnosing issues with web requests and not be given the path + query string. You can diagnose without but you're sure not making things easier

33. stronglikedan ◴[25 Aug 25 15:24 UTC] No.45014868{4}[source]▶

>>45011747 #

Sure, why not. Log every secret you come across (or that comes across you). Just don't log your own secrets. Like OP said, it lead down some interesting trails.

34. JohnFen ◴[25 Aug 25 15:27 UTC] No.45014904{8}[source]▶

>>45014297 #

> Do they not run web servers on the open web or something?

Until AI crawlers chased me off of the web, I ran a couple of fairly popular websites. I just so rarely see anybody including passwords in the URLs anymore that I didn't really consider that as what the commenter was talking about.

replies(1): >>45018636 #

35. JdeBP ◴[25 Aug 25 15:27 UTC] No.45014905{3}[source]▶

>>45013533 #

I remember putting dummy GET/PUT/HEAD/POST verbs into SMTP Relay softwares a quarter of a century ago. Attackers do not really save themselves time and money by being intelligent about this. So they aren't.

There are attackers out there that send SIP/2.0 OPTIONS requests to the GOPHER port, over TCP.

36. NegativeK ◴[25 Aug 25 15:41 UTC] No.45015080{3}[source]▶

>>45014807 #

I've heard this point raised elsewhere, and I think it's underplaying the magnitude of the issue.

Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.

The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.

AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.

replies(3): >>45015235 #>>45018955 #>>45045233 #

37. dkiebd ◴[25 Aug 25 15:46 UTC] No.45015142{3}[source]▶

>>45011816 #

I love the snark here. I work at a hosting company and the only customers who have issues with crawlers are those who have stupidly slow webpages. It’s hard to have any sympathy for them.

replies(1): >>45018800 #

38. diggan ◴[25 Aug 25 15:53 UTC] No.45015199{4}[source]▶

>>45012092 #

> Serving file content/diff requests from gitea/forgejo is quite expensive computationally

One time, sure. But unauthenticated requests would surely be cached, authenticated ones skip the cache (just like HN works :) ), as most internet-facing websites end up using this pattern.

replies(2): >>45015854 #>>45018346 #

39. kiitos ◴[25 Aug 25 15:56 UTC] No.45015235{4}[source]▶

>>45015080 #

this is a completely fair point, it may be the case that AI scraper bots have recently made the magnitude and/or details of unwanted bot traffic to public IP addresses much worse

but yeah the issue is that as long as you have something accessible to the public, it's ultimately your responsibility to deal with malicious/aggressive traffic

> At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.

I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward

replies(1): >>45059613 #

40. SoftTalker ◴[25 Aug 25 16:12 UTC] No.45015432{5}[source]▶

>>45014657 #

You probably shouldn't log usernames then, or really any form fields, as users might accidentally enter a password into one of them. Kind of defeats the point of web forms, but safety is important!

replies(2): >>45018570 #>>45019660 #

41. sidewndr46 ◴[25 Aug 25 16:14 UTC] No.45015451[source]▶

>>45011628 (TP) #

I don't think you have any idea how serious the issue is. I was loosely speaking in charge of application-level performance at one job for a web app. I was asked to make the backend as fast as possible at dumping the last byte of HTML back to the user.

The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.

60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.

Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.

There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.

You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.

Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.

replies(2): >>45018235 #>>45019598 #

42. SoftTalker ◴[25 Aug 25 16:15 UTC] No.45015462{7}[source]▶

>>45013284 #

Running ssh on 80 or 443 is a way to get around boneheaded firewalls that allow http(s) but block ssh, so it's not completely insane to see probes for it.

43. sidewndr46 ◴[25 Aug 25 16:16 UTC] No.45015487{3}[source]▶

>>45014807 #

I was kind of amazed to learn that apparently if you connect Windows NT4/98/2000/ME to a public IPv4 address it gets infected by what is a period correct worm in no time at all. I don't mean that someone uses an RCE to turn it into part of a botnet (that is expected), apparently there are enough infected hosts from 20+ years ago still out there that the sasser worm is still spreading.

replies(1): >>45018039 #

44. sidewndr46 ◴[25 Aug 25 16:17 UTC] No.45015499{3}[source]▶

>>45013533 #

It's even funnier when you realize it is a request for a known exploit in WordPress. Does someone really run that on port 22?

replies(1): >>45016435 #

45. q3k ◴[25 Aug 25 16:48 UTC] No.45015854{5}[source]▶

>>45015199 #

You can't feasibly cache large reposotories' diffs/content-at-version without reimplementing a significant part of git - this stuff is extremely high cardinality and you'd just constantly thrash the cache the moment someone does a BFS/DFS through available links (as these bots tend to do).

46. Sohcahtoa82 ◴[25 Aug 25 17:35 UTC] No.45016435{4}[source]▶

>>45015499 #

I HAVE heard of someone that runs SSH on port 443 and HTTPS on 22.

It blocks a lot of bots, but I feel like just running on a high port number (10,000+) would likely do better.

replies(1): >>45021476 #

47. p3rls ◴[25 Aug 25 17:43 UTC] No.45016540{3}[source]▶

>>45011816 #

i usually get 10 a second hitting the same content pages 10 times an hour, is that not what you guys are getting from google bot?

48. p3rls ◴[25 Aug 25 17:44 UTC] No.45016557{3}[source]▶

>>45011999 #

and this is how the entire web was turned into wordpress slop and cryptoscams

49. zelphirkalt ◴[25 Aug 25 19:13 UTC] No.45017742[source]▶

>>45011850 #

You are paying premium for data? Do you mean for traffic? Sounds like a bad deal to me. The tiniest Hetzner servers give you 20TB included per month. Either you really have lots of traffic, or you are paying for bad hosting deals.

50. hugo1789 ◴[25 Aug 25 19:37 UTC] No.45018039{4}[source]▶

>>45015487 #

I still remember how we installed Windows PCs at home if no media with the latest service pack was available. Install Windows, download service pack, copy it away, disconnect from internet, throw away everything and install Windows again...

51. dpkirchner ◴[25 Aug 25 19:39 UTC] No.45018054{4}[source]▶

>>45011747 #

I remember back before ssh was a thing folks would log login attempts -- it was easy to get some people's passwords because it was common for them to accidentally use them as the username (which are always safe to log, amirite?). All you had to do was watch for a failed login followed by a successful login from the same IP.

52. korkybuchek ◴[25 Aug 25 19:54 UTC] No.45018235[source]▶

>>45015451 #

> Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times.

Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):

https://www.niss.org/sites/default/files/Tassone_interface6....

53. Sesse__ ◴[25 Aug 25 20:05 UTC] No.45018346{5}[source]▶

>>45015199 #

There are _lots_ of objects in a large git repository. E.g., I happen to have a fork of VLC lying around. VLC has 70k+ commits (on that version). Each commit has about 10k files. The typical AI crawler wants, for every commit, to download every file (so 700M objects), every tarball (70k+ .tar.gz files), and the blame layer of every file (700M objects, where blame has to look back on average 35k commits). Plus some more.

Saying “just cache this” is not sustainable. And this is only one repository; the only reasonable way to deal with this is some sort of traffic mitigation, you cannot just deal with the traffic as the happy path.

54. Dylan16807 ◴[25 Aug 25 20:25 UTC] No.45018570{6}[source]▶

>>45015432 #

Are you using a very weird definition of "logging" to make a joke? Web forms don't need any logging to work.

replies(1): >>45034446 #

55. viridian ◴[25 Aug 25 20:32 UTC] No.45018636{9}[source]▶

>>45014904 #

Just about every crawler that tries probing for wordpress vulnerabilities does this, or includes them in the naked headers as a part of their deluge of requests.

56. egypturnash ◴[25 Aug 25 20:49 UTC] No.45018800{4}[source]▶

>>45015142 #

Isn't it part of your job to help them fix that?

replies(1): >>45019092 #

57. 0x457 ◴[25 Aug 25 21:00 UTC] No.45018955{4}[source]▶

>>45015080 #

So weird to scrape wikipedia when you can just download db dumb from them.

replies(2): >>45019746 #>>45029833 #

58. 0x457 ◴[25 Aug 25 21:11 UTC] No.45019092{5}[source]▶

>>45018800 #

How? They are hosting company, not a webshop.

59. maxbond ◴[25 Aug 25 21:22 UTC] No.45019187{8}[source]▶

>>45014297 #

> It's not like if someone sends me a request for /wp-login.php that my rails app suddenly becomes WordPress??

You're absolutely right. That's my mistake — you are requesting a specific version of WordPress, but I had written a Rails app. I've rewritten the app as a WordPress plugin and deployed it. Let me know if there's anything else I can do for you.

60. dang ◴[25 Aug 25 21:51 UTC] No.45019490[source]▶

>>45015038 #

Please don't cross into personal attack. You can make your substantive points without that.

replies(2): >>45019659 #>>45019706 #

61. hinkley ◴[25 Aug 25 21:59 UTC] No.45019598[source]▶

>>45015451 #

One of our customers was paying a third party to hit our website with garbage traffic a couple times a week to make sure we were rejecting malformed requests. I was forever tripping over these in Splunk while trying to look for legitimate problems.

We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.

And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.

I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.

replies(3): >>45019722 #>>45020189 #>>45022080 #

62. JohnMakin ◴[25 Aug 25 22:04 UTC] No.45019659{3}[source]▶

>>45019490 #

I disagree with that characterization of the post - merely commenting that a user that could come away with this take has never managed a web-facing service, because you'd immediately see the traffic is immense and constant, especially from crawlers. Sorry if I didn't elaborate that point clearly enough, point taken, I will more carefully craft such responses so such a point isn't misinterpreted or flagged.

replies(1): >>45022435 #

63. hinkley ◴[25 Aug 25 22:04 UTC] No.45019660{6}[source]▶

>>45015432 #

So no access logs at all then? That sounds effective.

64. hinkley ◴[25 Aug 25 22:07 UTC] No.45019689{7}[source]▶

>>45013284 #

I recall finding weird URLs in my access logs way back when where someone was trying to hit my machine with the CodeRed worm, a full decade after it was new. That was surreal.

65. phyzome ◴[25 Aug 25 22:09 UTC] No.45019706{3}[source]▶

>>45019490 #

It think it's fair play to claim that someone doesn't have relevant experience when it seems very clear that they do not.

replies(1): >>45022441 #

66. xp84 ◴[25 Aug 25 22:11 UTC] No.45019722{3}[source]▶

>>45019598 #

> they were on links marked nofollow

if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.

It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.

replies(1): >>45019830 #

67. hinkley ◴[25 Aug 25 22:13 UTC] No.45019745{3}[source]▶

>>45011816 #

We were only getting 60% of our from bots at my last place because we throttled a bunch of sketchy bots to around 50 simultaneous requests. Which was on the order of 100/s. Our customers were paying for SEO so the bot traffic was a substantial cost of doing business. But as someone tasked with decreasing cluster size I was forever jealous of the large amount of cluster thatwasn’t being seen by humans.

68. xp84 ◴[25 Aug 25 22:13 UTC] No.45019746{5}[source]▶

>>45018955 #

Really makes you think about the calibre of minds being applied to buzzy problem spaces these days, doesn't it?

replies(1): >>45020373 #

69. hinkley ◴[25 Aug 25 22:16 UTC] No.45019778{3}[source]▶

>>45011999 #

We were seeing over a million hits per hour from bots and I agree with GP. It’s fucking out of control. And it’s 100x worse at least if you sell vanity URLs, because the good bots cannot tell that they’re sending you 100 simultaneous requests by throttling on one domain and hitting five others instead.

70. hinkley ◴[25 Aug 25 22:21 UTC] No.45019830{4}[source]▶

>>45019722 #

The dumbest part is that we’d known about this for a long time and one day someone discovered we’d implemented a feature toggle to remove those URLs and then it just never got turned on, despite being announced that it had.

They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.

71. sidewndr46 ◴[25 Aug 25 22:58 UTC] No.45020189{3}[source]▶

>>45019598 #

I guess my position it was comparatively well behaved? There were bots that would full speed blitz the website, for absolutely no reason. You just scraped this page 27 seconds ago, do you really need to check it for an update again? Also it hasn't had a new post in the past 3 years, is it really going to start being lively again?

72. socalgal2 ◴[25 Aug 25 23:18 UTC] No.45020373{6}[source]▶

>>45019746 #

do we know they didn't download the DB? Maybe the new traffic is the LLM reading the site? (not the training)

I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.

Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.

replies(1): >>45031480 #

73. TZubiri ◴[25 Aug 25 23:27 UTC] No.45020463[source]▶

>>45012424 #

I'd wager a bet that there's a package.json lying somewhere that holds a lot of dependencies

74. mjmas ◴[26 Aug 25 02:06 UTC] No.45021476{5}[source]▶

>>45016435 #

I have a service running on a high port number on just a straight IPv4 and it does get a bit of bot traffic, but they are generally easy to filter out when looking at logs (well behaved ones have a domain in their User-Agent and bingbot takes my robots.txt into account. I dont think I've seen the Google crawler. Other bots can generally be worked out as anything that didn't request my manifest.json a few seconds after loading the main page)

75. dilyevsky ◴[26 Aug 25 03:52 UTC] No.45022080{3}[source]▶

>>45019598 #

> And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.

Googlebot uses different IP space from gcp

replies(1): >>45030250 #

76. dang ◴[26 Aug 25 05:01 UTC] No.45022435{4}[source]▶

>>45019659 #

I'm sure that would help, yes. Also, there's no need to phrase such a comment in terms of someone else lacking any experience of X - there are too many ways to get that wrong, and even if you're right, it can easily come across as a putdown. If you'd made your point in this case, for example, in terms of your own experience managing a web-facing service, you could have included all the same useful information, if not more!

(One other thing is that the "tell me without telling me" thing is an internet trope and the site guidelines ask people to avoid those - they tend to make for unsubstantive comments, plus they're repetitive and we're trying to avoid that here. But I just mention this for completeness - it's secondary to the other point.)

77. dang ◴[26 Aug 25 05:02 UTC] No.45022441{4}[source]▶

>>45019706 #

It's too easy for these things to seem clear and then turn out not to be right at all; moreover there's no need to get personal about these things - it has no benefit and there's an obvious cost.

replies(1): >>45022735 #

78. JohnMakin ◴[26 Aug 25 05:51 UTC] No.45022735{5}[source]▶

>>45022441 #

Again, I would challenge your assertion that this was a personal attack. This comment i responded to, to me, seemed to be coming from a place that has never managed such things on a public facing web interface. it does not seem possible to me to make such a comment without such prior knowledge. I will admit that i did not articulate my comment as such,as sibling comments sufficiently have done, and it probably came off as unnecessarily snarky, and for that I apologize - I do not see it as a personal attack though, at least not on purpose, and dont see it as being flag worthy. but that’s fine, I dont mod here, and dont pretend to know how it is to mod here. so in the future i guess i’ll just avoid such impossible to discern scrutiny if i can.

replies(1): >>45023012 #

79. dang ◴[26 Aug 25 06:34 UTC] No.45023012{6}[source]▶

>>45022735 #

It's possible the phrase "personal attack" means something a bit different to you than to me, because otherwise I don't think we're really disagreeing. Your good intentions are clear and I appreciate it! We can use a different phrase if you prefer.

I'd just add one other thing: there's one word in your post here which packs a huge amount of meaning and that's seemed (as in "seemed to be coming from a place [etc.]"). I can't tell you how often it happens that what seems one way to one user—even when the "seems" seems overwhelmingly likely, as in near-impossible that it could be any other way—turns out to simply be mistaken, or at least to seem quite opposite to the other person. It's thousands of times easier to make a mistake in this way than people realize; and unfortunately the cost can be quite high when that happens because the other person often feels indignant ("how dare you assume that I [etc.]").

In the present case, I don't know anything about the experience level of the user who posted https://news.ycombinator.com/item?id=45011628, but https://news.ycombinator.com/item?id=45011442 was definitely posted by someone who has managed heavy-duty web facing services, and that comment says more or less the same thing as the other one.

80. integralid ◴[26 Aug 25 11:34 UTC] No.45025114[source]▶

>>45011652 #

Thousands per hour is 0.3-3 requests per second, which is... not a lot? I host a personal website and it got much more noise before LLMs were even a thing.

81. nitwit005 ◴[26 Aug 25 17:42 UTC] No.45029833{5}[source]▶

>>45018955 #

When you have a pile of funding, and you get told to do things quickly.

replies(1): >>45031767 #

82. hinkley ◴[26 Aug 25 18:16 UTC] No.45030250{4}[source]▶

>>45022080 #

They use the same bank accounts and stock ticker. This is basically a non sequitur.

The point is they’re getting paid to run cloud servers to keep their bots happy and not dropping your website to page six.

replies(1): >>45030793 #

83. dilyevsky ◴[26 Aug 25 19:04 UTC] No.45030793{5}[source]▶

>>45030250 #

I thought the argument was that if you run on gcp you can masquerade as googlebot and not get a 429 which is obviously false. Instead it looks like the argument is more of a tinfoil hat variety.

btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?

84. xp84 ◴[26 Aug 25 19:51 UTC] No.45031480{7}[source]▶

>>45020373 #

I would guess site operators can tell the difference between an exhaustive crawl and the targeted specific traffic I'd expect to see from an LLM checking sources on-demand. For one thing, the latter would have time-based patterns attributable to waking hours in the relevant parts of the world, whereas the exhaustive crawl traffic would probably be pretty constant all day and night.

Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.

85. 0x457 ◴[26 Aug 25 20:14 UTC] No.45031767{6}[source]▶

>>45029833 #

But the correct way (getting a sql dump) is faster?

replies(1): >>45033424 #

86. nitwit005 ◴[26 Aug 25 23:06 UTC] No.45033424{7}[source]▶

>>45031767 #

Had to get the web scraper working for other websites.

87. SoftTalker ◴[27 Aug 25 01:32 UTC] No.45034446{7}[source]▶

>>45018570 #

You save them in a database. Probably in clear text. Six of one, half-dozen of the other.

replies(1): >>45035077 #

88. Dylan16807 ◴[27 Aug 25 03:13 UTC] No.45035077{8}[source]▶

>>45034446 #

A password being put into a normal text field in a properly submitted form is a lot less likely than getting into some query or path. And a database is more likely to be handled properly than some random log file.

Six of one, .008 of a dozen of the other.

89. BlueTemplar ◴[27 Aug 25 21:02 UTC] No.45045233{4}[source]▶

>>45015080 #

From your example (and many others), AI companies are engaging in DDoS too, so why wouldn't law enforcement target them too ?

replies(1): >>45059599 #

90. micahdeath ◴[28 Aug 25 21:21 UTC] No.45057178[source]▶

>>45012424 #

We have some bots that use residential IP blocks (including TMobile, AT&T, Verizon, etc)... When they hit, it's 1 request per IP, but they use 1000 IPs easily. Then we don't see that IP again for a week or more.

91. NegativeK ◴[29 Aug 25 02:53 UTC] No.45059599{5}[source]▶

>>45045233 #

As a first and very pessimistic guess, the pages getting DoSed are maintained by people or groups with pretty minimal resources. That means time or money available for lawyers isn't there, and the monetary impact per website is small enough that LE may not care.

Also, they might share the common viewpoint of "it's the internet; suck it up."

92. NegativeK ◴[29 Aug 25 02:55 UTC] No.45059613{5}[source]▶

>>45015235 #

> I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward

Kinda my point was that it's only the internet being the internet if we tolerate it. If enough people give a crap, the corporations doing it will have to knock it off.

replies(1): >>45069626 #

93. kiitos ◴[29 Aug 25 21:30 UTC] No.45069626{6}[source]▶

>>45059613 #

i appreciate the sentiment but no amount of people giving a crap will ever impact the stuff we're talking about here, because the stuff we're talking about here is in no way governed or influenced by popular opinion or anything even remotely adjacent to popular opinion

if you wanna rage against the machine then more power to you but this line of thinking is dead on arrival in terms of outcome

↑