←back to thread

211 points CrankyBear | 1 comments | | HN request time: 0.237s | source
Show context
bdefore ◴[] No.45107084[source]
I created and maintain ProtonDB, a popular Linux gaming resource. I don't do ads, just pay the bills from some Patreon donations.

It's a statically generated React site I deploy on Netlify. About ten days ago I started incurring 30GB of data per day from user agents indicating they're using Prerender. At this pace almost all of that will push me past the 1TB allotted for my plan, so I'm looking at an extra ~$500USD a month for the extra bandwdith boosters.

I'm gonna try the robots.txt options, but I'm doubtful this will be effective in the long run. Many other options aren't available if I want to continue using a SaaS like Netlify.

My initial thoughts are to either move to Cloudflare Pages/Workers where bandwidth is unlimited, or make an edge function that parses the user agent and hope it's effective enough. That'd be about $60 in edge function invocations.

I've got so many better things to do than play whack-a-mole on user agents and, when failing, pay this scraping ransom.

Can I just say fuck all y'all AI harvesters? This is a popular free service that helps get people off of their Microsoft dependency and live their lives on a libre operating system. You wanna leech on that? Fine, download the data dumps I already offer on an ODbL license instead of making me wonder why I fucking bother.

replies(9): >>45107161 #>>45107210 #>>45107270 #>>45107274 #>>45107832 #>>45107839 #>>45108629 #>>45108648 #>>45134051 #
immibis ◴[] No.45107210[source]
$500 for exceeding 1TB? The problem here isn't the crawlers, it's your price-gouging, extortionate hosting plan. Pick your favourite $5/month VPS platform - I suggest Hetzner with its 20TB limit (if their KYC process lets you in) or Digital Ocean if not (with only 1TB but overage is only a few bucks extra). Even freaking AWS, known for extremely high prices, is cheaper than that (but still too expensive so don't use it).
replies(4): >>45107316 #>>45107326 #>>45107393 #>>45107812 #
1. snerbles ◴[] No.45107812[source]
> The problem here isn't the crawlers, it's your price-gouging, extortionate hosting plan.

No, it's both.

The crawlers are lazy, apparently have no caching, and there is no immediately obvious way to instruct/force those crawlers to grab pages in a bandwidth-efficient manner. That being said, I would not be surprised if someone here will smugly contradict me with instructions on how to do just that.

In the near term, if I were hosting such a site I'd be looking into slimming down every byte I could manage, using fingerprinting to serve slim pages to the bots and exploring alternative hosting/CDN options.