←back to thread

597 points classichasclass | 3 comments | | HN request time: 0s | source
Show context
Etheryte ◴[] No.45010574[source]
One starts to wonder, at what point might it be actually feasible to do it the other way around, by whitelisting IP ranges. I could see this happening as a community effort, similar to adblocker list curation etc.
replies(9): >>45010597 #>>45010603 #>>45010604 #>>45010611 #>>45010624 #>>45010757 #>>45010872 #>>45010910 #>>45010935 #
1. jampa ◴[] No.45010935[source]
The Pokémon Go company tried that shortly after launch to block scraping. I remember they had three categories of IPs:

- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked

- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much

- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.

You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.

I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.

replies(2): >>45011393 #>>45011396 #
2. aorth ◴[] No.45011393[source]
I have an ad hoc system that is similar, comprised of three lists of networks: known good, known bad, and data center networks. These are rate limited using a geo map in nginx for various expensive routes in my application.

The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.

There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...

And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....

3. gunalx ◴[] No.45011396[source]
This really seems like they did everything they could and still got abused by borderline criminal activity from scrapers. But i do really think it had an impact on scraping, it is just a matter of attrition and raising the cost so it should hurt more to scrape, the problem really never can go away, because at some point the scrapers can just start paying regular users to collect the data.