←back to thread

597 points classichasclass | 6 comments | | HN request time: 0.617s | source | bottom
Show context
bob1029 ◴[] No.45011628[source]
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

replies(7): >>45011652 #>>45011830 #>>45011850 #>>45012424 #>>45012462 #>>45015038 #>>45015451 #
1. threeducks ◴[] No.45012424[source]
> The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product.

I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

replies(4): >>45012679 #>>45014295 #>>45020463 #>>45057178 #
2. rollcat ◴[] No.45012679[source]
> I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.

Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.

Also: who's your webhost? $1/m sounds like a steal.

replies(1): >>45013992 #
3. immibis ◴[] No.45013992[source]
You can sometimes find special (loss-leader) deals in this range on LowEndTalk. Typically you'll have to pay upfront for a block of one or two years.
4. TylerE ◴[] No.45014295[source]
It starting hitting endpoints that do lots of db thrashing, and it’s usually ones that are NOT common or recent so caching won’t save you.

Serving up a page that takes a few dozen db queries is a lot different than serving a static page.

5. TZubiri ◴[] No.45020463[source]
I'd wager a bet that there's a package.json lying somewhere that holds a lot of dependencies
6. micahdeath ◴[] No.45057178[source]
We have some bots that use residential IP blocks (including TMobile, AT&T, Verizon, etc)... When they hit, it's 1 request per IP, but they use 1000 IPs easily. Then we don't see that IP again for a week or more.