Most active commenters

    ←back to thread

    Anubis Works

    (xeiaso.net)
    319 points evacchi | 13 comments | | HN request time: 1.021s | source | bottom
    1. raggi ◴[] No.43668972[source]
    It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

    I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

    Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

    Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

    replies(4): >>43669199 #>>43671284 #>>43671404 #>>43671687 #
    2. jtbayly ◴[] No.43669199[source]
    My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.
    3. cedws ◴[] No.43671284[source]
    PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.
    replies(1): >>43671553 #
    4. adrian17 ◴[] No.43671404[source]
    > but of course doing so requires some expertise, and expertise still isn't cheap

    Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.

    If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.

    replies(1): >>43677175 #
    5. xena ◴[] No.43671553[source]
    Xe here. If I had to guess in two words: timing and luck. As the G-man said: the right man in the wrong place can make all the difference in the world. I was the right shitposter in the right place at the right time.

    And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.

    replies(1): >>43671812 #
    6. mrweasel ◴[] No.43671687[source]
    > I am kind of surprised how many sites seem to want/need this.

    The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

    replies(1): >>43675613 #
    7. underdeserver ◴[] No.43671812{3}[source]
    Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.
    8. MrJohz ◴[] No.43675613[source]
    Fwiw, I run a pretty tiny site and see relatively minimal traffic coming from bots. Most of the bot traffic, when it appears, is vulnerability scanners (the /wp-admin/ requests on a static site), and has little impact on my overall stats.

    My guess is that these tools tend to be targeted at mid-sized sites — the sorts of places that are large enough to have useful content, but small enough that there probably won't be any significant repercussions, and where the ops team is small enough (or plain nonexistent) that there's not going to be much in the way of blocks. That's why a site like SourceHut gets hit quite badly, but smaller blogs stay largely out of the way.

    But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.

    replies(2): >>43675658 #>>43678728 #
    9. nicolapcweek94 ◴[] No.43675658{3}[source]
    Well, we just spun up anubis in front of a two user private (as in publicly accessible but with almost all content set to private/login protected) forgejo instance after it started getting hammered (mostly by amazon ips presenting as amazonbot) earlier in the week, resulting in a >90% traffic reduction. From what we’ve seen (and Xe’s own posts) it seems git forges are getting hit harder than most other sites, though, so YMMV i guess.
    10. raggi ◴[] No.43677175[source]
    I have trouble with that because it’s brimming full of plugins too (see them all disorganized all over the source), and failing to keep such a system up to date ends in tears rapidly in that ecosystem.
    11. mrweasel ◴[] No.43678728{3}[source]
    I actually have a theory, based on the last episode of the 2.5 admins podcast. Try spinning up a MediaWiki site. I have a feeling that wiki installation are being targeted to a much higher degree. You could also do a Git repo of some sort. Either two could give the impression that content is changed frequently.
    replies(2): >>43678852 #>>43679407 #
    12. MrJohz ◴[] No.43678852{4}[source]
    I could believe that. Plus, because both of those are more dynamic, they're going to have to do more work per request anyway, meaning the effects of scraping are exacerbated.
    13. gyaru ◴[] No.43679407{4}[source]
    yep, I'm running a pretty sizeable game Wiki and it's being scraped to hell with very specific urls that pretty much guarantees cache busting. (usually revision ids and diffs)