←back to thread

198 points todsacerdoti | 1 comments | | HN request time: 0.001s | source
Show context
SquareWheel ◴[] No.45942060[source]
That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.
replies(2): >>45942461 #>>45942729 #
klodolph ◴[] No.45942461[source]
That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.
replies(1): >>45942613 #
varenc ◴[] No.45942613[source]
An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.

replies(5): >>45942682 #>>45942689 #>>45942743 #>>45942744 #>>45943011 #
1. kijin ◴[] No.45942689[source]
How does a server tell an agent acting on behalf of a real person from the unwashed masses of scrapers? Do agents send a special header or token that other scrapers can't easily copy?

They get lumped together because they're more or less indistinguishable and cause similar problems: server load spikes, increased bandwidth, increased AWS bill ... with no discernible benefit for the server operator such as increased user engagement or ad revenue.

Now all automated requests are considered guilty until proven innocent. If you want your agent to be allowed, it's on you to prove that you're different. Maybe start by slowing down your agent so that it doesn't make requests any faster than the average human visitor would.