Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.
People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.
edit: ah yes another person above mentioned VPN's thats a good possibility, also another vector is users on mobile can sell their extra data that they dont use to 3rd parties. probably many more ways to acquire endpoints.
Of course, if you don’t care about affecting genuine users then it is much simpler. One could say it’s collateral damage and show a message suggesting to boycott companies and/or business practices that prompted these measures.