Blocking LLM crawlers without JavaScript

(www.owl.is)

198 points todsacerdoti | 1 comments | 15 Nov 25 23:30 UTC | HN request time: 0s | source

Show context

SquareWheel ◴[16 Nov 25 01:51 UTC] No.45942060[source]▶

That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.

replies(2): >>45942461 #>>45942729 #

klodolph ◴[16 Nov 25 03:14 UTC] No.45942461[source]▶

>>45942060 #

That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.

replies(1): >>45942613 #

varenc ◴[16 Nov 25 03:50 UTC] No.45942613[source]▶

>>45942461 #

An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.

replies(5): >>45942682 #>>45942689 #>>45942743 #>>45942744 #>>45943011 #

1. saurik ◴[16 Nov 25 05:33 UTC] No.45943011[source]▶

>>45942613 #

If your specific and narrowly scoped instructions cause the agent, acting on your behalf, to click that link that clearly isn't going to help it--a link that is only being clicked by the scrapers because the scrapers are blindly downloading everything they can find without having any real goal--then, frankly, you might as well be blocked also, as your narrowly scoped instructions must literally have been something like "scrape this website without paying any attention to what you are doing", as an actual agent--just like an actual human--wouldn't find our click that link (and that this is true has nothing at all to do with robots.txt).

↑