Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.
Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.
If you are doing broad crawling, you already need to do this kind of thing anyway.
There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.
Not really? As mentioned by others, such tarpits are easily mitigated by using a priority queue. For instance, crawlers can prioritize external links over internal links, which means if your blog post makes it to HN, it'll get crawled ahead of the tarpit. If it's discoverable and readable by actual humans, AI bots will be able to scrape it.
My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.
But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.
In addition to Quixotic (my tool) and Napthenes, I know of:
* https://github.com/Fingel/django-llm-poison
* https://codeberg.org/MikeCoats/poison-the-wellms
* https://codeberg.org/timmc/marko/
0 - https://marcusb.org/hacks/quixotic.html
1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt
Maybe you don't want your your stuff to get thrown into the latest silicon valley commercial operation without getting paid for it. That seems like a valid position to take. Or maybe you just don't want Claude's ridiculously badly behaved scraper to chew through your entire budget.
Regardless, scrapers that don't follow the rules like robots.txt pretty quickly will discover why those rules exist in the first place as they receive increasing amounts of garbage.
httpunch() {
local url=$1
local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
local action=$1
local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
local silent_mode=false
# Check if "kill" was passed as the first argument
if [[ $action == "kill" ]]; then
echo "Killing all curl processes..."
pkill -f "curl --no-buffer"
return
fi
# Parse optional --silent argument
for arg in "$@"; do
if [[ $arg == "--silent" ]]; then
silent_mode=true
break
fi
done
# Ensure URL is provided if "kill" is not used
if [[ -z $url ]]; then
echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
return 1
fi
echo "Starting $connections connections to $url..."
for ((i = 1; i <= connections; i++)); do
if $silent_mode; then
curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
else
curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
fi
done
echo "$connections connections started with a keepalive time of $keepalive_time seconds."
echo "Use 'httpunch kill' to terminate them."
}
(Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>.- https://gist.github.com/pmarreck/970e5d040f9f91fd9bce8a4bcee...
LLMs are an accelerant, like all previous tools... Not a replacement, although it seems most people still need to figure that out for themselves while I already have