Most active commenters

    ←back to thread

    211 points CrankyBear | 13 comments | | HN request time: 0.001s | source | bottom
    Show context
    thaumaturgy ◴[] No.45107225[source]
    People outside of a really small sysadmin niche really don't grasp the scale of this problem.

    I run a small-but-growing boutique hosting infrastructure for agency clients. The AI bot crawler problem recently got severe enough that I couldn't just ignore it anymore.

    I'm stuck between, on one end, crawlers from companies that absolutely have the engineering talent and resources to do things right but still aren't, and on the other end, resource-heavy WordPress installations where the client was told it was a build-it-and-forget-it kind of thing. I can't police their robots.txt files; meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL), there are about 6 different pretty aggressive AI bots, and occasionally they'll get stuck on some site's product variants or categories pages and start hitting it at a 1r/s rate.

    There's an invisible caching layer that does a pretty nice job with images and the like, so it's not really a bandwidth problem. The bots aren't even requesting images and other page resources very often; they're just doing tons and tons of page requests, and each of those is tying up a DB somewhere.

    Cumulatively, it is close to having a site get Slashdotted every single day.

    I finally started filtering out most bot and crawler traffic at nginx, before it gets passed off to a WP container. I spent a fair bit of time sampling traffic from logs, and at a rough guess, I'd say maybe 5% of web traffic is currently coming from actual humans. It's insane.

    I've just wrapped up the first round of work for this problem, but that's just buying a little time. Now, I've gotta put together an IP intelligence system, because clearly these companies aren't gonna take "403" for an answer.

    replies(5): >>45107483 #>>45107586 #>>45108498 #>>45109192 #>>45110318 #
    1. jazzyjackson ◴[] No.45107586[source]
    Couldn't it be addressed in front of the application with a fail2ban rule, some kind of 429 Too Many Requests quota on a per session basis? Or are the crawlers anonymizing themselves / coming from different IP addresses?
    replies(3): >>45107681 #>>45107705 #>>45107786 #
    2. sc68cal ◴[] No.45107681[source]
    They are spreading themselves across lots of different IP blocks
    3. thaumaturgy ◴[] No.45107705[source]
    Yeah, that's where IP intelligence comes in. They're using pretty big IP pools, so, either you're manually adding individual IPs to a list all day (and updating that list as ASNs get continuously shuffled around), or you've got a process in the background that essentially does whois lookups (and caches them, so you aren't also being abusive), parses the metadata returned, and decides whether that request is "okay" or not.

    The classic 80/20 rule applies. You can catch about 80% of lazy crawler activity pretty easily with something like this, but the remaining 20% will require a lot more effort. You start encountering edge cases, like crawlers that use AWS for their crawling activity, but also one of your customers somewhere is syncing their WooCommerce orders to their in-house ERP system via a process that also runs on AWS.

    replies(1): >>45114283 #
    4. loloquwowndueo ◴[] No.45107786[source]
    Its called Anubis.
    replies(2): >>45108592 #>>45121081 #
    5. dylan604 ◴[] No.45108592[source]
    Isn't that the one that shows anime characters? Or is Anubis the "professional" version that doesn't show anime chars?
    replies(1): >>45109183 #
    6. greazy ◴[] No.45109183{3}[source]
    Yes that's Anubis. And yes you pay to not show anime cat girl.
    replies(1): >>45113221 #
    7. tempaccount420 ◴[] No.45113221{4}[source]
    That's genius.
    replies(1): >>45114023 #
    8. krapp ◴[] No.45114023{5}[source]
    Honestly the more Anubis' anime mascot annoys people the more I like it.
    replies(1): >>45116582 #
    9. asddubs ◴[] No.45114283[source]
    I've had crawlers get stuck in a loop before on a search page where you basically could just keep adding things, even if there are no results. I filtered requests that are bots for sure (requests which are specified long past the point of any results). It was over a million unique IPs, most of which only doing 1 or 2 requests on their own (from many different ip blocks)
    10. dylan604 ◴[] No.45116582{6}[source]
    The point of this is to make things difficult for bots, not to annoy visitors of the site. I respect it is the dev's choice to do what they want with the software they create and make available for free. Anime is a polarizing format for reasons beyond the scope of this discussion. It definitely says a lot about the dev
    replies(2): >>45116790 #>>45127239 #
    11. krapp ◴[] No.45116790{7}[source]
    Anime is only "polarizing" for an extreme subset of people. Most people won't care. No one should care, it's just a cute mascot image.
    12. croemer ◴[] No.45121081[source]
    Anubis blocks all phones with odd processor counts, many Pixel phones for example.
    13. linotype ◴[] No.45127239{7}[source]
    It says a lot more about the pearl clutching of the people complaining about it than it does the dev.