Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.
How does this make you any different than the bad faith LLM actors they are trying to block?
They get lumped together because they're more or less indistinguishable and cause similar problems: server load spikes, increased bandwidth, increased AWS bill ... with no discernible benefit for the server operator such as increased user engagement or ad revenue.
Now all automated requests are considered guilty until proven innocent. If you want your agent to be allowed, it's on you to prove that you're different. Maybe start by slowing down your agent so that it doesn't make requests any faster than the average human visitor would.
Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt.
This is not banning you for following <h1><a>Today's Weather</a></h1>
If you are a robot that's so poorly coded that it is following links it clearly shouldn't that's are explicitly numerated as not to be followed, that's a problem. From an operator's perspective, how is this different than a case you described.
If a googler kicked off the googlebot manually from a session every morning, should they not respect robots.txt either?
The line gets blurrier with things like OAI's Atlas browser. It's just re-skinned Chromium that's a regular browser, but you can ask an LLM about the content of the page you just navigated to. The decision to use an LLM on that page is made after the page load. Doing the same thing but without rendering the page doesn't seem meaningfully different.
In general robots.txt is for headless automated crawlers fetching many pages, not software performing a specific request for a user. If there's 1:1 mapping between a user's request and a page load, then it's not a robot. An LLM powered user agent (browser) wouldn't follow invisible links, or any links, because it's not crawling.
The significant difference isn't in whether a robot is doing the actions for you or not, it's whether the robot is a user agent for a human or not.
Just trying to make the point that an LLM powered user agent fetching a single page at my request isn't a robot.