Anubis Works | slacker news

1. raggi ◴[13 Apr 25 00:14 UTC] No.43668972[source]▶

It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

replies(4): >>43669199 #>>43671284 #>>43671404 #>>43671687 #

2. jtbayly ◴[13 Apr 25 01:08 UTC] No.43669199[source]▶

>>43668972 (TP) #

My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.

3. cedws ◴[13 Apr 25 09:08 UTC] No.43671284[source]▶

>>43668972 (TP) #

PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.

replies(1): >>43671553 #

4. adrian17 ◴[13 Apr 25 09:27 UTC] No.43671404[source]▶

>>43668972 (TP) #

> but of course doing so requires some expertise, and expertise still isn't cheap

Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.

If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.

replies(1): >>43677175 #

5. xena ◴[13 Apr 25 09:57 UTC] No.43671553[source]▶

>>43671284 #

Xe here. If I had to guess in two words: timing and luck. As the G-man said: the right man in the wrong place can make all the difference in the world. I was the right shitposter in the right place at the right time.

And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.

replies(1): >>43671812 #

6. mrweasel ◴[13 Apr 25 10:21 UTC] No.43671687[source]▶

>>43668972 (TP) #

> I am kind of surprised how many sites seem to want/need this.

The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

replies(1): >>43675613 #

7. underdeserver ◴[13 Apr 25 10:48 UTC] No.43671812{3}[source]▶

>>43671553 #

Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.

8. MrJohz ◴[13 Apr 25 20:28 UTC] No.43675613[source]▶

>>43671687 #

Fwiw, I run a pretty tiny site and see relatively minimal traffic coming from bots. Most of the bot traffic, when it appears, is vulnerability scanners (the /wp-admin/ requests on a static site), and has little impact on my overall stats.

My guess is that these tools tend to be targeted at mid-sized sites — the sorts of places that are large enough to have useful content, but small enough that there probably won't be any significant repercussions, and where the ops team is small enough (or plain nonexistent) that there's not going to be much in the way of blocks. That's why a site like SourceHut gets hit quite badly, but smaller blogs stay largely out of the way.

But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.

replies(2): >>43675658 #>>43678728 #

9. nicolapcweek94 ◴[13 Apr 25 20:37 UTC] No.43675658{3}[source]▶

>>43675613 #

Well, we just spun up anubis in front of a two user private (as in publicly accessible but with almost all content set to private/login protected) forgejo instance after it started getting hammered (mostly by amazon ips presenting as amazonbot) earlier in the week, resulting in a >90% traffic reduction. From what we’ve seen (and Xe’s own posts) it seems git forges are getting hit harder than most other sites, though, so YMMV i guess.

10. raggi ◴[14 Apr 25 01:13 UTC] No.43677175[source]▶

>>43671404 #

I have trouble with that because it’s brimming full of plugins too (see them all disorganized all over the source), and failing to keep such a system up to date ends in tears rapidly in that ecosystem.

11. mrweasel ◴[14 Apr 25 06:55 UTC] No.43678728{3}[source]▶

>>43675613 #

I actually have a theory, based on the last episode of the 2.5 admins podcast. Try spinning up a MediaWiki site. I have a feeling that wiki installation are being targeted to a much higher degree. You could also do a Git repo of some sort. Either two could give the impression that content is changed frequently.

replies(2): >>43678852 #>>43679407 #

12. MrJohz ◴[14 Apr 25 07:17 UTC] No.43678852{4}[source]▶

>>43678728 #

I could believe that. Plus, because both of those are more dynamic, they're going to have to do more work per request anyway, meaning the effects of scraping are exacerbated.

13. gyaru ◴[14 Apr 25 09:00 UTC] No.43679407{4}[source]▶

>>43678728 #

yep, I'm running a pretty sizeable game Wiki and it's being scraped to hell with very specific urls that pretty much guarantees cache busting. (usually revision ids and diffs)