Most active commenters

varenc(5)
hyperhopper(3)

Popular/hot comments

>>45942613 #

←back to thread

Blocking LLM crawlers without JavaScript

(www.owl.is)

Show context

SquareWheel ◴[16 Nov 25 01:51 UTC] No.45942060[source]▶

>>45941441 (OP) #

That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.

replies(2): >>45942461 #>>45942729 #

1. klodolph ◴[16 Nov 25 03:14 UTC] No.45942461[source]▶

>>45942060 #

That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.

replies(1): >>45942613 #

2. varenc ◴[16 Nov 25 03:50 UTC] No.45942613[source]▶

>>45942461 (TP) #

An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.

replies(5): >>45942682 #>>45942689 #>>45942743 #>>45942744 #>>45943011 #

3. hyperhopper ◴[16 Nov 25 04:10 UTC] No.45942682[source]▶

>>45942613 #

Confused as to what you're asking for here. You want a robot acting out of spec, to not be treated as a robot acting out of spec, because you told it to?

How does this make you any different than the bad faith LLM actors they are trying to block?

replies(2): >>45942728 #>>45942925 #

4. kijin ◴[16 Nov 25 04:10 UTC] No.45942689[source]▶

>>45942613 #

How does a server tell an agent acting on behalf of a real person from the unwashed masses of scrapers? Do agents send a special header or token that other scrapers can't easily copy?

They get lumped together because they're more or less indistinguishable and cause similar problems: server load spikes, increased bandwidth, increased AWS bill ... with no discernible benefit for the server operator such as increased user engagement or ad revenue.

Now all automated requests are considered guilty until proven innocent. If you want your agent to be allowed, it's on you to prove that you're different. Maybe start by slowing down your agent so that it doesn't make requests any faster than the average human visitor would.

5. ronsor ◴[16 Nov 25 04:19 UTC] No.45942728{3}[source]▶

>>45942682 #

robots.txt is for automated, headless crawlers, NOT user-initiated actions. If a human directly triggers the action, then robots.txt should not be followed.

replies(1): >>45942842 #

6. mcv ◴[16 Nov 25 04:21 UTC] No.45942743[source]▶

>>45942613 #

If it's a robot it should follow robots.txt. And if it's following invisible links it's clearly crawling.

Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt.

replies(2): >>45942854 #>>45942890 #

7. AmbroseBierce ◴[16 Nov 25 04:22 UTC] No.45942744[source]▶

>>45942613 #

Maybe your agent is smart enough to determine that going against the wishes of the website owner can be detrimental to your relationship the such website owner and therefore the likelihood of the website to continue existing, so is prioritizing your long-term interests over your short-term ones.

8. hyperhopper ◴[16 Nov 25 04:48 UTC] No.45942842{4}[source]▶

>>45942728 #

But what action are you triggering that automatically follows invisible links? Especially those not meant to be followed with text saying not to follow them.

This is not banning you for following <h1><a>Today's Weather</a></h1>

If you are a robot that's so poorly coded that it is following links it clearly shouldn't that's are explicitly numerated as not to be followed, that's a problem. From an operator's perspective, how is this different than a case you described.

If a googler kicked off the googlebot manually from a session every morning, should they not respect robots.txt either?

replies(1): >>45942863 #

9. varenc ◴[16 Nov 25 04:51 UTC] No.45942854{3}[source]▶

>>45942743 #

should cURL follow robots.txt? What makes browser software not a robot? Should `curl <URL>` ignore robots.txt but `curl <URL> | llm` respect it?

The line gets blurrier with things like OAI's Atlas browser. It's just re-skinned Chromium that's a regular browser, but you can ask an LLM about the content of the page you just navigated to. The decision to use an LLM on that page is made after the page load. Doing the same thing but without rendering the page doesn't seem meaningfully different.

In general robots.txt is for headless automated crawlers fetching many pages, not software performing a specific request for a user. If there's 1:1 mapping between a user's request and a page load, then it's not a robot. An LLM powered user agent (browser) wouldn't follow invisible links, or any links, because it's not crawling.

replies(1): >>45943425 #

10. varenc ◴[16 Nov 25 04:54 UTC] No.45942863{5}[source]▶

>>45942842 #

I was responding to someone earlier saying a user agent should respect robots.txt. An LLM powered user-agent wouldn't follow links, invisible or not, because it's not crawling.

replies(1): >>45942913 #

11. droopyEyelids ◴[16 Nov 25 05:00 UTC] No.45942890{3}[source]▶

>>45942743 #

Your web browser is a robot, and always has been. Even using netcat to manually type your GET request is a robot in some sense, as you have a machine translating your ascii and moving it between computers.

The significant difference isn't in whether a robot is doing the actions for you or not, it's whether the robot is a user agent for a human or not.

12. hyperhopper ◴[16 Nov 25 05:05 UTC] No.45942913{6}[source]▶

>>45942863 #

It very feasibly could. If I made an LLM agent that clicks on a returned element, and then the element was this trap doored link, that would happen

replies(1): >>45950078 #

13. Spivak ◴[16 Nov 25 05:07 UTC] No.45942925{3}[source]▶

>>45942682 #

You're equating asking Siri to call your mom to using a robo-dialer machine.

14. saurik ◴[16 Nov 25 05:33 UTC] No.45943011[source]▶

>>45942613 #

If your specific and narrowly scoped instructions cause the agent, acting on your behalf, to click that link that clearly isn't going to help it--a link that is only being clicked by the scrapers because the scrapers are blindly downloading everything they can find without having any real goal--then, frankly, you might as well be blocked also, as your narrowly scoped instructions must literally have been something like "scrape this website without paying any attention to what you are doing", as an actual agent--just like an actual human--wouldn't find our click that link (and that this is true has nothing at all to do with robots.txt).

15. mcv ◴[16 Nov 25 07:40 UTC] No.45943425{4}[source]▶

>>45942854 #

How did you get the url for curl? Do you personally look for hidden links in pages to follow? This isn't an issue for people looking at the page, it's only a problem for systems that automatically follow all the links on a page.

replies(1): >>45950881 #

16. varenc ◴[17 Nov 25 01:24 UTC] No.45950078{7}[source]▶

>>45942913 #

There's a fuzzy line between an agent analyzing the content of a single page I requested, and one making many page fetches on my behalf. I think it's fair to treat an agent that clicks an invisible link as a robot/crawler since that agent is causing more traffic than a regular user agent (browser).

Just trying to make the point that an LLM powered user agent fetching a single page at my request isn't a robot.

17. varenc ◴[17 Nov 25 04:36 UTC] No.45950881{5}[source]▶

>>45943425 #

Yea i think the context for my reply got lost. I was responding to someone saying that an LLM powered user-agent (browser) should respect robots.txt. And it wouldn't be clicking the hidden link because it's not crawling.

↑