←back to thread

198 points todsacerdoti | 5 comments | | HN request time: 0.208s | source
Show context
DeepYogurt ◴[] No.45942404[source]
Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.
replies(5): >>45942457 #>>45942733 #>>45942771 #>>45942875 #>>45946525 #
btown ◴[] No.45942875[source]
IMO there was something of a de facto contract, pre-LLMs, that the set of things one would publicly mirror/excerpt/index and the set of things one would scrape were one and the same.

Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.

People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.

replies(3): >>45943046 #>>45943240 #>>45943282 #
cwbriscoe ◴[] No.45943240[source]
I am not well versed in this problem but can't the web servers rate limit by known IP addresses of these crawler/scrapers?
replies(4): >>45943317 #>>45943344 #>>45943570 #>>45943597 #
strogonoff ◴[] No.45943344[source]
You cannot block LLM crawlers by IP address, because some of them use residential proxies. Source: 1) a friend admins a slightly popular site and has decent bot detection heuristics, 2) just Google “residential proxy LLM”, they are not exactly hiding. Strip-mining original intellectual property for commercial usage is big business.
replies(2): >>45943398 #>>45943472 #
1. skrebbel ◴[] No.45943398[source]
How does this work? Why would people let randos use their home internet connections? I googled it but the companies selling these services are not exactly forthcoming on how they obtained their "millions of residential IP addresses".

Are these botnets? Are AI companies mass-funding criminal malware companies?

replies(3): >>45943441 #>>45943478 #>>45943557 #
2. stackghost ◴[] No.45943441[source]
>Are these botnets? Are AI companies mass-funding criminal malware companies?

Without a doubt some of them are botnets. AI companies got their initial foothold by violating copyright en masse with pirated textbook dumps for training data, and whatnot. Why should they suddenly develop scruples now?

3. joha4270 ◴[] No.45943478[source]
I have seen it claimed that's a way of monetizing free phone apps. Just bundle a proxy and get paid for that.
replies(1): >>45943539 #
4. cuu508 ◴[] No.45943539[source]
A recent HN thread about this: https://news.ycombinator.com/item?id=45746156
5. fakwandi_priv ◴[] No.45943557[source]
It used to be Hola VPN which would let you use someone else’s connection and in the same way someone could use yours which was communicated transparently, that same hola client would also route business users. Im sure many other free VPN clients do the same thing nowadays.