←back to thread

198 points todsacerdoti | 3 comments | | HN request time: 0.015s | source
Show context
SquareWheel ◴[] No.45942060[source]
That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.
replies(2): >>45942461 #>>45942729 #
klodolph ◴[] No.45942461[source]
That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.
replies(1): >>45942613 #
varenc ◴[] No.45942613[source]
An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.

replies(5): >>45942682 #>>45942689 #>>45942743 #>>45942744 #>>45943011 #
hyperhopper ◴[] No.45942682[source]
Confused as to what you're asking for here. You want a robot acting out of spec, to not be treated as a robot acting out of spec, because you told it to?

How does this make you any different than the bad faith LLM actors they are trying to block?

replies(2): >>45942728 #>>45942925 #
ronsor ◴[] No.45942728{3}[source]
robots.txt is for automated, headless crawlers, NOT user-initiated actions. If a human directly triggers the action, then robots.txt should not be followed.
replies(1): >>45942842 #
hyperhopper ◴[] No.45942842{4}[source]
But what action are you triggering that automatically follows invisible links? Especially those not meant to be followed with text saying not to follow them.

This is not banning you for following <h1><a>Today's Weather</a></h1>

If you are a robot that's so poorly coded that it is following links it clearly shouldn't that's are explicitly numerated as not to be followed, that's a problem. From an operator's perspective, how is this different than a case you described.

If a googler kicked off the googlebot manually from a session every morning, should they not respect robots.txt either?

replies(1): >>45942863 #
1. varenc ◴[] No.45942863{5}[source]
I was responding to someone earlier saying a user agent should respect robots.txt. An LLM powered user-agent wouldn't follow links, invisible or not, because it's not crawling.
replies(1): >>45942913 #
2. hyperhopper ◴[] No.45942913[source]
It very feasibly could. If I made an LLM agent that clicks on a returned element, and then the element was this trap doored link, that would happen
replies(1): >>45950078 #
3. varenc ◴[] No.45950078[source]
There's a fuzzy line between an agent analyzing the content of a single page I requested, and one making many page fetches on my behalf. I think it's fair to treat an agent that clicks an invisible link as a robot/crawler since that agent is causing more traffic than a regular user agent (browser).

Just trying to make the point that an LLM powered user agent fetching a single page at my request isn't a robot.