←back to thread

Anubis Works

(xeiaso.net)
313 points evacchi | 6 comments | | HN request time: 0.853s | source | bottom
Show context
raggi ◴[] No.43668972[source]
It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

replies(4): >>43669199 #>>43671284 #>>43671404 #>>43671687 #
1. mrweasel ◴[] No.43671687[source]
> I am kind of surprised how many sites seem to want/need this.

The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

replies(1): >>43675613 #
2. MrJohz ◴[] No.43675613[source]
Fwiw, I run a pretty tiny site and see relatively minimal traffic coming from bots. Most of the bot traffic, when it appears, is vulnerability scanners (the /wp-admin/ requests on a static site), and has little impact on my overall stats.

My guess is that these tools tend to be targeted at mid-sized sites — the sorts of places that are large enough to have useful content, but small enough that there probably won't be any significant repercussions, and where the ops team is small enough (or plain nonexistent) that there's not going to be much in the way of blocks. That's why a site like SourceHut gets hit quite badly, but smaller blogs stay largely out of the way.

But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.

replies(2): >>43675658 #>>43678728 #
3. nicolapcweek94 ◴[] No.43675658[source]
Well, we just spun up anubis in front of a two user private (as in publicly accessible but with almost all content set to private/login protected) forgejo instance after it started getting hammered (mostly by amazon ips presenting as amazonbot) earlier in the week, resulting in a >90% traffic reduction. From what we’ve seen (and Xe’s own posts) it seems git forges are getting hit harder than most other sites, though, so YMMV i guess.
4. mrweasel ◴[] No.43678728[source]
I actually have a theory, based on the last episode of the 2.5 admins podcast. Try spinning up a MediaWiki site. I have a feeling that wiki installation are being targeted to a much higher degree. You could also do a Git repo of some sort. Either two could give the impression that content is changed frequently.
replies(2): >>43678852 #>>43679407 #
5. MrJohz ◴[] No.43678852{3}[source]
I could believe that. Plus, because both of those are more dynamic, they're going to have to do more work per request anyway, meaning the effects of scraping are exacerbated.
6. gyaru ◴[] No.43679407{3}[source]
yep, I'm running a pretty sizeable game Wiki and it's being scraped to hell with very specific urls that pretty much guarantees cache busting. (usually revision ids and diffs)