Most active commenters
  • marginalia_nu(4)
  • rkagerer(3)
  • beeflet(3)

←back to thread

253 points akyuu | 37 comments | | HN request time: 0.002s | source | bottom
1. BinaryIgor ◴[] No.45945045[source]
I wonder why is it that we get an increase in these automated scrapers and attacks as of late (some few years); is there better (open-source?) technology that allows it? Is it because hosting infrastructure is cheaper also for the attackers? Both? Something else?

Maybe the long-term solution for such attacks is to hide most of the internet behind some kind of Proof of Work system/network, so that mostly humans get to access to our websites, not machines.

replies(6): >>45945393 #>>45945467 #>>45945584 #>>45945643 #>>45945917 #>>45945959 #
2. trenchpilgrim ◴[] No.45945393[source]
Using AI you can write a naive scraper in minutes and there's now a market demand for cleaned up and structured data.
3. marginalia_nu ◴[] No.45945467[source]
What's missing is effective international law enforcement. This is a legal problem first and foremost. As long as it's as easy as it is to get away with this stuff by just routing the traffic through a Russian or Singaporean node, it's going to keep happening. With international diplomacy going the way it has been, odds of that changing aren't fantastic.

The web is really stuck between a rock and a hard place when it comes to this. Proof of work helps website owners, but makes life harder for all discovery tools and search engines.

An independent standard for request signing and building some sort of reputation database for verified crawlers could be part of a solution, though that causes problems with websites feeding crawlers different content than users, an does nothing to fix the Sybil attack problem.

replies(4): >>45945725 #>>45945809 #>>45945986 #>>45946661 #
4. rkagerer ◴[] No.45945584[source]
long-term solution

How about a reputation system?

Attached to IP address is easiest to grok, but wouldn't work well since addresses lack affinity. OK, so we introduce an identifier that's persistent, and maybe a user can even port it between devices. Now it's bad for privacy. How about a way a client could prove their reputation is above some threshold without leaking any identifying information? And a decentralized way for the rest of the internet to influence their reputation (like when my server feels you're hammering it)?

Do anti-DDoS intermediaries like Cloudflare basically catalog a spectrum of reputation at the ASN level (pushing anti-abuse onus to ISP's)?

This is basically what happened to email/SMTP, for better or worse :-S.

replies(2): >>45945700 #>>45945797 #
5. hnthrowaway0315 ◴[] No.45945643[source]
I guess it is just because 1) They can, and 2) Everyone wants some data. I think it would be interesting if every website out there starts to push out BS pages just for scrappers. Not sure how much extra cost it's going to take if a website puts up say 50% BS pages that only scrappers can reach, or BS material with extremely small fonts hidden in regular pages that ordinary people cannot see.
replies(1): >>45945694 #
6. inerte ◴[] No.45945694[source]
Something like https://blog.cloudflare.com/ai-labyrinth/ ?
replies(1): >>45950847 #
7. JimDabell ◴[] No.45945700[source]
Reputation plus privacy is probably unsolvable; the whole point of reputation is knowing what people are doing elsewhere. You don’t need reputation, you need persistence. You don’t need to know if they are behaving themselves elsewhere on the Internet as long as you can ban them once and not have them come back.

Services need the ability to obtain an identifier that:

- Belongs to exactly one real person.

- That a person cannot own more than one of.

- That is unique per-service.

- That cannot be tied to a real-world identity.

- That can be used by the person to optionally disclose attributes like whether they are an adult or not.

Services generally don’t care about knowing your exact identity but being able to ban a person and not have them simply register a new account, and being able to stop people from registering thousands of accounts would go a long way towards wiping out inauthentic and abusive behaviour.

The ability to “reset” your identity is the underlying hole that enables a vast amount of abuse. It’s possible to have persistent, pseudonymous access to the Internet without disclosing real-world identity. Being able to permanently ban abusers from a service would have a hugely positive effect on the Internet.

replies(3): >>45945753 #>>45945857 #>>45946182 #
8. luckylion ◴[] No.45945725[source]
It's not necessarily going through a Russian or Singaporean node though, on the sites I'm responsible for, AWS, GCP, Azure are in the top 5 for attackers. It's just that they don't care _at all_ about that happening.

I don't think you need world-wide law-enforcement, it'll be a big step ahead if you make owners & operators liable. You can limit exposure so nobody gets absolutely ruined, but anyone running wordpress 4.2 and getting their VPS abused for attacks currently has 0 incentive to change anything unless their website goes down. Give them a penalty of a few hundred dollars and suddenly they do. To keep things simple, collect from the hosters, they can then charge their customers, and suddenly they'll be interested in it as well, because they don't want to deal with that.

The criminals are not held liable, and neither are their enablers. There's very little chance anything will change that way.

replies(1): >>45946156 #
9. jasonjayr ◴[] No.45945753{3}[source]
A digital "Death penalty" is not a win for society, without considering a fair way to atone for "crimes against your digital identity".

It would be way to easy for the current regime (whomever that happens to be) to criminalize random behaviors (Trans People? Atheists? Random nationality?) to ban their identity, and then they can't apply for jobs, get bus fare, purchase anything online, communicate with their lawyers, etc.

replies(2): >>45945924 #>>45946411 #
10. gmuslera ◴[] No.45945797[source]
It's ironic to use reputation system for this.

20+ years ago there were mail blacklists that basically blocked residential IP blocks as there should not be servers trying to send normal mail from there. Now you must try the opposite, blacklist blocks where only servers and not end users can come from, as there is potentially bad behaved scrapers in all major clouds and server hosting platforms.

But then there are residential proxies that pay end users to route requests from misbehaved companies, so that door is also a bad mitigation

replies(1): >>45946258 #
11. Aurornis ◴[] No.45945809[source]
> What's missing is effective international law enforcement.

International law enforcement on the Internet would also subject you to the laws of other countries. It goes both ways.

Having to comply with all of the speech laws and restrictions in other countries is not actually something you want.

replies(2): >>45945922 #>>45946229 #
12. hombre_fatal ◴[] No.45945857{3}[source]
If creating an identity has a cost, then why not allow people to own multiple identities? Might help on the privacy front and address the permadeath issue.

Of course everything sounds plausible when speaking at such a high level.

replies(2): >>45946204 #>>45946316 #
13. Vegenoid ◴[] No.45945917[source]
I'm pretty sure it is the commercial demand for data from AI companies. It is certainly the popular conception among sysadmins that it is AI companies who are responsible for the wave of scrapers over the past few years, and I see no compelling alternative.
replies(1): >>45946032 #
14. ocdtrekkie ◴[] No.45945922{3}[source]
This is already kind of true with every global website, the idea of a single global internet is one of those fairy tale fantasy things, that maybe happened for a little bit before enough people used it. In many cases it isn't really ideal today.
15. ◴[] No.45945924{4}[source]
16. EGreg ◴[] No.45945959[source]
Why? It’s because of AI. It enables attacks at scale. It enables more people to attack, who previously couldn’t. And so on.

It’s very explainable. And somehow, like clockwork, there are always comments to say “there is nothing new, the Internet has always been like this since the 80s”.

You know, part of me wants to see AI proliferate into more and more areas, just so these people will finally wake up eventually and understand there is a huge difference when AI does it. When they are relentlessly bombarded with realistic phone calls from random numbers, with friends and family members calling about the latest hoax and deepfake, when their own specific reputation is constantly attacked and destroyed by 1000 cuts not just online but in their own trusted circles, and they have to put out fires and play whack-a-mole with an advanced persistent threat that only grows larger and always comes from new sources, anonymous and not.

And this is all before bot swarms that can coordinate and plan long-term, targeting specific communities and individuals.

And this is all before humanoid robots and drones proliferate.

Just try to fast-forward to when human communities online and offline are constantly infiltrated by bots and drones and sleeper agents, playing nice for a long time and amassing karma / reputation / connections / trust / whatever until finally doing a coordinated attack.

Honestly, people just don’t seem to get it until it’s too late. Same with ecosystem destruction — tons of people keep strawmanning it as mere temperature shifts, even while ecosystems around the world get destroyed. Kelp forests. Rainforests. Coral reefs. Fish. Insects. And they’re like “haha global warming by 3 degrees big deal. Temperature has always changed on the planet.” (Sound familiar?)

Look, I don’t actually want any of this to happen. But if they could somehow experience the movie It’s a Wonderful Life or meet the Ghost of Christmas Yet to Come, I’d wholeheartedly want every denier to have that experience. (In fact, a dedicated attacker can already give them a taste of this with current technology. I am sure it will become a decentralized service soon :-( )

replies(1): >>45946120 #
17. armchairhacker ◴[] No.45945986[source]
I don’t think this can solved legally without compromising anonymity. You can block unrecognized clients and punish the owners of clients that behave badly, but then, for example, an oppressive government can (physically) take over a subversive website and punish everyone who accesses it.

Maybe pseudo-anonymity and “punishment” via reputation could work. Then an oppressive government with access to a subversive website (ignoring bad security, coordination with other hijacked sites, etc.) can only poison its clients’ reputations, and (if reputation is tied to sites, who have their own reputations) only temporarily.

replies(1): >>45946200 #
18. embedding-shape ◴[] No.45946032[source]
> and I see no compelling alternative.

Another potential cause: It's way easier for pretty much any person connected to the internet to "create" their own automation software by using LLMs. I could wager even the less smart LLMs could handle "Create a program that checks this website every second for any product updates on all pages" and give enough instructions for the average computer user to be able to run it without thinking or considering much.

Multiply this by every person with access to an LLM who wants to "do X with website Y" and you'll get an magnitude increase in traffic across the internet. This been possible since what, 2023 sometime? Not sure if the patterns would line up, but just another guess for the cause(s).

19. hshdhdhj4444 ◴[] No.45946120[source]
Our tech overlords understand AI, especially any form of AGI, will basically be the end of humanity. That’s why they’re entirely focused on being the first and amassing as much wealth in the meanwhile, giving up on any sort of consideration whether they’re doing good for people or not.
20. mrweasel ◴[] No.45946156{3}[source]
The big cloud provides needs to step up and take responsibility. I understand that it can't be to easy to do, but we really do need a way to contact e.g. AWS and tell them to shut of a costumer. I have no problem with someone scraping our websites, but I care that they don't do so responsibly, slow down when we start responding slower, don't assume that you can just go full throttle, crash our site, wait, and then do it again once we start responding again.

You're absolutely right: AWS, GCP, Azure and others, they do not care and especially AWS and GCP are massive enablers.

replies(1): >>45946560 #
21. lifty ◴[] No.45946182{3}[source]
Zero knowledge proof constructs have the potential to solve these kind of privacy/reputation tradeoffs.
22. ajuc ◴[] No.45946200{3}[source]
> but then, for example, an oppressive government can (physically) take over a subversive website and punish everyone who accesses it.

Already happens. Oppressive governments already punish people for visiting "wrong" websites. They already censor internet.

There are no technological solutions to coordination problems. Ultimately, no matter what you invent, it's politics that will decide how it's used and by whom.

23. rkagerer ◴[] No.45946204{4}[source]
I agree and think the ability to spin up new identities is crucial to any sort of successful reputation system (and reflects the realities of how both good and bad actors would use it). Think back to early internet when you wanted an identity in one community (e.g. forums about games you play) that was separate from another (e.g. banking). But it means those reputation identities need to take some investment (e.g. of time / contribution / whatever) to build, and can't become usefully trusted until reaching some threshold.
replies(1): >>45948848 #
24. marginalia_nu ◴[] No.45946229{3}[source]
We have historically solved this via treaties.

If you want to trade with me, a country that exports software, let's agree to both criminalize software piracy.

No reason why this can't be extended to DDoS attacks.

replies(1): >>45947147 #
25. rkagerer ◴[] No.45946258{3}[source]
It's interesting that along another axis, the inertia of the internet moved from a decentralized structure back toward something that resembles mainframes. I don't think those axes are orthogonal.
26. TylerE ◴[] No.45946316{4}[source]
Because of course what this world needs is for the wealthy to have even more advantages over the normies. (Hint: If you're reading this, and think you're one of the wealthy ones, you aren't)
27. JimDabell ◴[] No.45946411{4}[source]
Describing “I don’t want to provide service to you and I should have the means of doing so” as a “digital death penalty” is a tad hyperbolic, don’t you think?

> It would be way to easy for the current regime (whomever that happens to be) to criminalize random behaviors (Trans People? Atheists? Random nationality?) to ban their identity, and then they can't apply for jobs, get bus fare, purchase anything online, communicate with their lawyers, etc.

Authoritarian regimes can already do that.

I think perhaps you might’ve missed the fact that what I was suggesting was individual to each service:

> Reputation plus privacy is probably unsolvable; the whole point of reputation is knowing what people are doing elsewhere. You don’t need reputation, you need persistence. You don’t need to know if they are behaving themselves elsewhere on the Internet as long as you can ban them once and not have them come back.

I was saying don’t care about what people are doing elsewhere on the Internet. Just ban locally – but persistently.

28. ctoth ◴[] No.45946560{4}[source]
> we really do need a way to contact e.g. AWS and tell them to shut of a costumer.

You realize you just described the infrastructure for far worse abuse than a misconfigured scraper, right?

replies(1): >>45947001 #
29. BinaryIgor ◴[] No.45946661[source]
Good points; I would definitely vouch for an independent standard for request signing + some kind of decentralized reputation system. With international law enforcement, I think there could be too many political issues for it not become corrupt
30. mrweasel ◴[] No.45947001{5}[source]
I'm very aware of that, yes. There needs to be a good process, the current situation where AWS simply does not care, or doesn't know also isn't particularly good. One solution could be for victims to notify AWS that a number of specified IP are generating an excessive amount of traffic. An operator could then verify with AWS traffic logs, notify the customer that they are causing issue and only after a failure to respond could the customer be shut down.

You're not wrong that abuse would be a massive issue, but I'm on the other side of this and need Amazon to do something, anything.

31. beeflet ◴[] No.45947147{4}[source]
I don't want governments to have this level of control over the internet. It seems like you are paving over a technological problem with the way the internet is designed by giving some institution a ton of power over the internet.
replies(1): >>45948248 #
32. marginalia_nu ◴[] No.45948248{5}[source]
The alternative to governments stopping misbehavior is every website hiding behind Cloudflare or a small number of competitors, which is a situation that is far more susceptible to abuse than having a law that says you can't DDoS people even if you live in Singapore.

It really can not be overstated how unsustainable the status quo is.

replies(1): >>45949871 #
33. nucleardog ◴[] No.45948848{5}[source]
Yep, this is basically how I'd implement it if I needed to. Just tackle the problem in reverse here: Don't assume users are good and try and track which are bad, assume users are bad and track which are good.

Look at the HN karma system--you start with limited features, and as you show yourself a good user, you get more features (and also trust/standing with the community). "Resetting" your identity only ever loses you something.

Apply the same thing to a git host getting hammered or something--by default, users can't view the history online or something (can still clone), but as your identity establishes reputation (through positive interactions, or even just browsing in a non-bot-like manner), your reputation increases and you get rate-limited access or something.

This is essentially where a lot of spam ended up--it used to be that your mail was deliverable until you acted poorly, then your reputation was bad and your deliverability went down. Now it more closely resembles this--your reputation is bad until you send enough good mail and take enough good actions (DKIM/SPF, etc) to show yourself as good.

The issues really all stems from "resetting your identity gets you back in good standing". Once you take that out of the mix, you no longer need to worry much about limiting identities, tying them to the real world, ensuring they're persistent, or many of the other hard problems that come up.

34. beeflet ◴[] No.45949871{6}[source]
I think the alternative is to recreate the internet with more p2p friendly infrastructure. BitTorrent does not have this same DDoS problem. Mesh networks are designed with sybil resistance in mind
replies(1): >>45952200 #
35. hnthrowaway0315 ◴[] No.45950847{3}[source]
Yeah something like this, would be nice if it actually feeds bad data that requires human to double confirm, too. Not something seriously wrong but something subtle, like changing a couple of letters in a name of a country, or randomize the National day. Once a lot of websites start to use it AI might actually get confused, I think? But humans never read these pages so should be largely fine -- unless they are reading AI summaries.
36. marginalia_nu ◴[] No.45952200{7}[source]
The internet already is p2p infrastructure.

BitTorrent is just as susceptible to this, it's just there's currently no economic incentive to try to exhaustively scrape it from 50,000 VPS nodes.

replies(1): >>45958930 #
37. beeflet ◴[] No.45958930{8}[source]
>The internet already is p2p infrastructure.

No, it really isn't. Unless you mean like on the BGP level. But it's p2p in the sense where you have to trust every party not to break the system. It's like email or mastodon, it doesn't solve the fundamental sybil problem at hand.

>BitTorrent is just as susceptible to this,

In bittorrent things are hosted by adhoc users are that are roughly proportional to the number of downloaders. It is not unimaginable that you could staple a reputation system on top of it like PTs already do.