Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.
People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.
Are these botnets? Are AI companies mass-funding criminal malware companies?
Without a doubt some of them are botnets. AI companies got their initial foothold by violating copyright en masse with pirated textbook dumps for training data, and whatnot. Why should they suddenly develop scruples now?
edit: ah yes another person above mentioned VPN's thats a good possibility, also another vector is users on mobile can sell their extra data that they dont use to 3rd parties. probably many more ways to acquire endpoints.
I wouldn't be surprised if it was the same with LLMs. Millions of workers allocated dynamically on AWS, with varying IPs.
In my specific case, as I was dealing with browser-initiated traffic, I wrote a Firefox add-on instead. No such shortcut for web servers, though.
Your DNS mostly passes lookup requests but during homework time, when there's a request for the ip for "www.youtube.com" it returns the ip of your choice instead of the actual one. The domain's TTL is 5 minutes.
Or don't, technical solutions to social problems are of limited value.
I've deployed the same one for me, but setup for Reddit during work hours.
Both of us know how to get around the add-on. It's not particularly hard. But since Firefox is the primary browser for both of us, it does the trick.
I'm not affiliated with them, but it has helped me when I really need to focus.
In this case, I don't have a server I can conveniently use as DNS. Plus I wanted to also control the launching of some binaries, so that would considerably complicate the architecture.
Maybe next time :)
Of course, if you don’t care about affecting genuine users then it is much simpler. One could say it’s collateral damage and show a message suggesting to boycott companies and/or business practices that prompted these measures.