Blocking LLM crawlers without JavaScript

(www.owl.is)

198 points todsacerdoti | 2 comments | 15 Nov 25 23:30 UTC | HN request time: 0s | source

Show context

DeepYogurt ◴[16 Nov 25 03:00 UTC] No.45942404[source]▶

Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.

replies(5): >>45942457 #>>45942733 #>>45942771 #>>45942875 #>>45946525 #

btown ◴[16 Nov 25 04:56 UTC] No.45942875[source]▶

>>45942404 #

IMO there was something of a de facto contract, pre-LLMs, that the set of things one would publicly mirror/excerpt/index and the set of things one would scrape were one and the same.

Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.

People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.

replies(3): >>45943046 #>>45943240 #>>45943282 #

cwbriscoe ◴[16 Nov 25 06:40 UTC] No.45943240[source]▶

>>45942875 #

I am not well versed in this problem but can't the web servers rate limit by known IP addresses of these crawler/scrapers?

replies(4): >>45943317 #>>45943344 #>>45943570 #>>45943597 #

strogonoff ◴[16 Nov 25 07:09 UTC] No.45943344[source]▶

>>45943240 #

You cannot block LLM crawlers by IP address, because some of them use residential proxies. Source: 1) a friend admins a slightly popular site and has decent bot detection heuristics, 2) just Google “residential proxy LLM”, they are not exactly hiding. Strip-mining original intellectual property for commercial usage is big business.

replies(2): >>45943398 #>>45943472 #

1. globalnode ◴[16 Nov 25 07:50 UTC] No.45943472[source]▶

>>45943344 #

so user either has a malware proxy running requests without being noticed or voluntarily signed up as a proxy to make extra $ off their home connection. Either way I dont care if their IP is blocked. Only problem is if users behind CGNAT get their IP blocked then legitimate users may later be blocked.

edit: ah yes another person above mentioned VPN's thats a good possibility, also another vector is users on mobile can sell their extra data that they dont use to 3rd parties. probably many more ways to acquire endpoints.

replies(1): >>45951439 #

2. strogonoff ◴[17 Nov 25 07:09 UTC] No.45951439[source]▶

>>45943472 (TP) #

“Known IP addresses” to me implies an infrequently changing list of large datacenter ranges. Maintaining a dynamic list (along with any metadata required for throttling purposes) of individual IPs is a different undertaking with higher level of effort.

Of course, if you don’t care about affecting genuine users then it is much simpler. One could say it’s collateral damage and show a message suggesting to boycott companies and/or business practices that prompted these measures.

↑