←back to thread

198 points todsacerdoti | 2 comments | | HN request time: 0.447s | source
Show context
DeepYogurt ◴[] No.45942404[source]
Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.
replies(5): >>45942457 #>>45942733 #>>45942771 #>>45942875 #>>45946525 #
btown ◴[] No.45942875[source]
IMO there was something of a de facto contract, pre-LLMs, that the set of things one would publicly mirror/excerpt/index and the set of things one would scrape were one and the same.

Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.

People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.

replies(3): >>45943046 #>>45943240 #>>45943282 #
cwbriscoe ◴[] No.45943240[source]
I am not well versed in this problem but can't the web servers rate limit by known IP addresses of these crawler/scrapers?
replies(4): >>45943317 #>>45943344 #>>45943570 #>>45943597 #
Yoric ◴[] No.45943570[source]
Not the exact same problem, but a few months ago, I tried to block youtube traffic from my home (I was writing a parental app for my child) by IP. After a few hours of trying to collect IPs, I gave up, realizing that YouTube was dynamically load-balanced across millions of IPs, some of which also served traffic from other Google services I didn't want to block.

I wouldn't be surprised if it was the same with LLMs. Millions of workers allocated dynamically on AWS, with varying IPs.

In my specific case, as I was dealing with browser-initiated traffic, I wrote a Firefox add-on instead. No such shortcut for web servers, though.

replies(2): >>45943593 #>>45947250 #
bonsai_spool ◴[] No.45943593[source]
Why not have local DNS at your router and do a block there? It can even be per-client with adguardhome
replies(1): >>45943661 #
Yoric ◴[] No.45943661[source]
I did that, but my router doesn't offer a documented API (or even a ssh access) that I can use to reprogram DNS blocks dynamically. I wanted to stop YouTube only during homework hours, so enabling/disabling it a few times per day quickly became tiresome.
replies(1): >>45944806 #
extra88 ◴[] No.45944806[source]
Your router almost certainly lets you assign a DNS instead of using whatever your ISP sends down so you set it to an internal device running your DNS.

Your DNS mostly passes lookup requests but during homework time, when there's a request for the ip for "www.youtube.com" it returns the ip of your choice instead of the actual one. The domain's TTL is 5 minutes.

Or don't, technical solutions to social problems are of limited value.

replies(1): >>45945344 #
Yoric ◴[] No.45945344[source]
Any solution based on this sounds monstruously more complicated than my browser addon.

And technical bandaids to hyperactivity, however imperfect, are damn useful.

replies(2): >>45945464 #>>45946643 #
extra88 ◴[] No.45945464[source]
A browser add-on wouldn't do the job. The use case was a parent controlling a child's behavior, not someone controlling their own.
replies(1): >>45945517 #
1. Yoric ◴[] No.45945517[source]
Yes, my kid has ADHD. The browser add-on does the job at slowing down the impulse of going to YouTube (and a few online gaming sites) during homework hours.

I've deployed the same one for me, but setup for Reddit during work hours.

Both of us know how to get around the add-on. It's not particularly hard. But since Firefox is the primary browser for both of us, it does the trick.

replies(1): >>45946704 #
2. FrinkleFrankle ◴[] No.45946704[source]
For those that don't want to build their own addon, Cold turkey Blocker works quite well. It supports multiple browsers and can block apps too.

I'm not affiliated with them, but it has helped me when I really need to focus.

https://getcoldturkey.com/