←back to thread

198 points todsacerdoti | 2 comments | | HN request time: 0.622s | source
Show context
DeepYogurt ◴[] No.45942404[source]
Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.
replies(5): >>45942457 #>>45942733 #>>45942771 #>>45942875 #>>45946525 #
btown ◴[] No.45942875[source]
IMO there was something of a de facto contract, pre-LLMs, that the set of things one would publicly mirror/excerpt/index and the set of things one would scrape were one and the same.

Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.

People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.

replies(3): >>45943046 #>>45943240 #>>45943282 #
1. wredcoll ◴[] No.45943282[source]
For all its sins, google had a vested interest in the sites it was linking to stay alive. Llms don't.
replies(1): >>45943575 #
2. eric-burel ◴[] No.45943575[source]
That's a shortcut, llm providers are very short sighted but not to that extreme, alive websites are needed to produce new data for future trainings. Edit: damn I've seen this movie before