←back to thread

198 points todsacerdoti | 1 comments | | HN request time: 0s | source
Show context
SquareWheel ◴[] No.45942060[source]
That may work for blocking bad automated crawlers, but an agent acting on behalf of a user wouldn't follow robots.txt. They'd run the risk of hitting the bad URL when trying to understand the page.
replies(2): >>45942461 #>>45942729 #
klodolph ◴[] No.45942461[source]
That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.
replies(1): >>45942613 #
varenc ◴[] No.45942613[source]
An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.

replies(5): >>45942682 #>>45942689 #>>45942743 #>>45942744 #>>45943011 #
mcv ◴[] No.45942743[source]
If it's a robot it should follow robots.txt. And if it's following invisible links it's clearly crawling.

Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt.

replies(2): >>45942854 #>>45942890 #
1. droopyEyelids ◴[] No.45942890{3}[source]
Your web browser is a robot, and always has been. Even using netcat to manually type your GET request is a robot in some sense, as you have a machine translating your ascii and moving it between computers.

The significant difference isn't in whether a robot is doing the actions for you or not, it's whether the robot is a user agent for a human or not.