←back to thread

211 points CrankyBear | 1 comments | | HN request time: 0s | source
Show context
thaumaturgy ◴[] No.45107225[source]
People outside of a really small sysadmin niche really don't grasp the scale of this problem.

I run a small-but-growing boutique hosting infrastructure for agency clients. The AI bot crawler problem recently got severe enough that I couldn't just ignore it anymore.

I'm stuck between, on one end, crawlers from companies that absolutely have the engineering talent and resources to do things right but still aren't, and on the other end, resource-heavy WordPress installations where the client was told it was a build-it-and-forget-it kind of thing. I can't police their robots.txt files; meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL), there are about 6 different pretty aggressive AI bots, and occasionally they'll get stuck on some site's product variants or categories pages and start hitting it at a 1r/s rate.

There's an invisible caching layer that does a pretty nice job with images and the like, so it's not really a bandwidth problem. The bots aren't even requesting images and other page resources very often; they're just doing tons and tons of page requests, and each of those is tying up a DB somewhere.

Cumulatively, it is close to having a site get Slashdotted every single day.

I finally started filtering out most bot and crawler traffic at nginx, before it gets passed off to a WP container. I spent a fair bit of time sampling traffic from logs, and at a rough guess, I'd say maybe 5% of web traffic is currently coming from actual humans. It's insane.

I've just wrapped up the first round of work for this problem, but that's just buying a little time. Now, I've gotta put together an IP intelligence system, because clearly these companies aren't gonna take "403" for an answer.

replies(5): >>45107483 #>>45107586 #>>45108498 #>>45109192 #>>45110318 #
gjsman-1000 ◴[] No.45107483[source]
I might write a blog post on this, but I seriously believe we collectively need to rethink The Cathedral and the Bazaar.

The Cathedral won. Full stop. Everyone, more or less, is just a stonecutter, competing to sell the best stone (i.e. content, libraries, source code, tooling) for building the cathedrals with. If the world is a farmer's market, we're shocked that the farmer's market is not defeating Walmart, and never will.

People want Cathedrals; not Bazaars. Being a Bazaar vendor is a race to the bottom. This is not the Cathedral exploiting a "tragedy of the commons," it's intrinsic to decentralization as a whole. The Bazaar feeds the Cathedral, just as the farmers feed Walmart, just as independent websites feed Claude, a food chain and not an aberration.

replies(2): >>45107893 #>>45109253 #
1. AnthonyMouse ◴[] No.45109253[source]
> The Bazaar feeds the Cathedral

Isn't this the licensing problem? Berkeley release BSD so that everyone can use it, people do years of work to make it passable, Apple takes it to make macOS and iOS because the license allows them to, and then they have both the community's work and their own work so everyone uses that.

The Linux kernel is GPLv2, not GPLv3, so vendors distribute binary blob drivers/firmware with their hardware and then the hardware becomes unusable as soon as they stop publishing new versions because then to use the hardware you're stuck with an old kernel with known security vulnerabilities, or they lock the boot loader because v2 lacks the anti-Tivoization clause in v3.

If you use a license that lets the cathedral close off the community's work then you lose, but what if you don't do that?