←back to thread

198 points todsacerdoti | 1 comments | | HN request time: 0.26s | source
Show context
DeepYogurt ◴[] No.45942404[source]
Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.
replies(5): >>45942457 #>>45942733 #>>45942771 #>>45942875 #>>45946525 #
1. superkuh ◴[] No.45942771[source]
Recently there have been more crawlers coming from tens to hundreds of IP netblocks from dozens (or more!) of ASN in highly time and URL correlated fashion with spoofed user-agent(s) and no regard for rate or request limiting or robots.txt. These attempt to visit every possible permutation of URLs on the domain and have a lot of bandwidth and established tcp connections available to them. It's not that this didn't happen pre-2023 but it's noticably more common now. If you have a public webserver you've probably experienced it at least once.

Actual LLM involvement as the requesting user-agent is vanishingly small. It's the same problem as ever: corporations, their profit motive during $hypecycle coupled with access to capital for IT resources, and the protection of the abusers via the company's abstraction away of legal liability for their behavior.