←back to thread

252 points lgats | 5 comments | | HN request time: 0.77s | source

I have been struggling with a bot– 'Mozilla/5.0 (compatible; crawler)' coming from AWS Singapore – and sending an absurd number of requests to a domain of mine, averaging over 700 requests/second for several months now. Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

I've submitted several complaints to AWS to get this traffic to stop, their typical followup is: We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.

I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.

The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.

I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.

Are there others that have similar experience?

1. shishcat ◴[] No.45613884[source]
if it follows redirect, redirct him to a 10gb gzip bomb
replies(2): >>45613936 #>>45614278 #
2. nake89 ◴[] No.45613936[source]
I was just going to post the same thing. Happy somebody else thought of the same thing :D
replies(1): >>45614115 #
3. sixtyj ◴[] No.45614115[source]
You nasty ones ;)
4. cantor_S_drug ◴[] No.45614278[source]
https://zadzmo.org/code/nepenthes/

This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

https://news.ycombinator.com/item?id=42725147

Is this a good solution??

replies(1): >>45614552 #
5. iberator ◴[] No.45614552[source]
Best tarpit ever.