Popular/hot comments

>>43576317 #

Curl-impersonate: Special build of curl that can impersonate the major browsers

(github.com)

Show context

jchw ◴[03 Apr 25 16:40 UTC] No.43572243[source]▶

I'm rooting for Ladybird to gain traction in the future. Currently, it is using cURL proper for networking. That is probably going to have some challenges (I think cURL is still limited in some ways, e.g. I don't think it can do WebSockets over h2 yet) but on the other hand, having a rising browser engine might eventually remove this avenue for fingerprinting since legitimate traffic will have the same fingerprint as stock cURL.

replies(6): >>43572413 #>>43573011 #>>43574225 #>>43576912 #>>43580376 #>>43583469 #

nonrandomstring ◴[03 Apr 25 19:25 UTC] No.43574225[source]▶

>>43572243 #

When I spoke to these guys [0] we touched on those quirks and foibles that make a signature (including TCP stack stuff beyond control of any userspace app).

I love this curl, but I worry that if a component takes on the role of deception in order to "keep up" it accumulates a legacy of hard to maintain "compatibility" baggage.

Ideally it should just say... "hey I'm curl, let me in"

The problem of course lies with a server that is picky about dress codes, and that problem in turn is caused by crooks sneaking in disguise, so it's rather a circular chicken and egg thing.

[0] https://cybershow.uk/episodes.php?id=39

replies(2): >>43574560 #>>43575789 #

immibis ◴[03 Apr 25 19:53 UTC] No.43574560[source]▶

>>43574225 #

What should instead happen is that Chrome should stop sending as much of a fingerprint, so that sites won't be able to fingerprint. That won't happen, since it's against Google's interests.

replies(1): >>43574900 #

gruez ◴[03 Apr 25 20:22 UTC] No.43574900[source]▶

>>43574560 #

This is a fundamental misunderstanding of how TLS fingerprinting works. The "fingerprint" isn't from chrome sending a "fingerprint: [random uuid]" attribute in every TLS negotiation. It's derived from various properties of the TLS stack, like what ciphers it can accept. You can't make "stop sending as much of a fingerprint", without every browser agreeing on the same TLS stack. It's already minimal as it is, because there's basically no aspect of the TLS stack that users can configure, and chrome bundles its own, so you'd expect every chrome user to have the same TLS fingerprint. It's only really useful to distinguish "fake" chrome users (eg. curl with custom header set, or firefox users with user agent spoofer) from "real" chrome users.

replies(2): >>43574983 #>>43584170 #

dochtman ◴[03 Apr 25 20:30 UTC] No.43574983[source]▶

>>43574900 #

Part of the fingerprint is stuff like the ordering of extensions, which Chrome could easily do but AFAIK doesn’t.

(AIUI Google’s Play Store is one of the biggest TLS fingerprinting culprits.)

replies(2): >>43575010 #>>43575074 #

shiomiru ◴[03 Apr 25 20:37 UTC] No.43575074[source]▶

>>43574983 #

Chrome has randomized its ClientHello extension order for two years now.[0]

The companies to blame here are solely the ones employing these fingerprinting techniques, and those relying on services of these companies (which is a worryingly large chunk of the web). For example, after the Chrome change, Cloudflare just switched to a fingerprinter that doesn't check the order.[1]

[0]: https://chromestatus.com/feature/5124606246518784

[1]: https://blog.cloudflare.com/ja4-signals/

replies(2): >>43575406 #>>43576104 #

1. nonrandomstring ◴[03 Apr 25 21:05 UTC] No.43575406[source]▶

>>43575074 #

> blame here are solely the ones employing these fingerprinting techniques,

Sure. And it's a tragedy. But when you look at the bot situation and the sheer magnitude of resource abuse out there, you have to see it from the other side.

FWIW the conversation mentioned above, we acknowledged that and moved on to talk about behavioural fingerprinting and why it makes sense not to focus on the browser/agent alone but what gets done with it.

replies(1): >>43576317 #

2. NavinF ◴[03 Apr 25 22:38 UTC] No.43576317[source]▶

>>43575406 (TP) #

Last time I saw someone complaining about scrapers, they were talking about 100gib/month. That's 300kbps. Less than $1/month in IP transit and ~$0 in compute. Personally I've never noticed bots show up on a resource graph. As long as you don't block them, they won't bother using more than a few IPs and they'll backoff when they're throttled

replies(3): >>43576459 #>>43577683 #>>43579340 #

3. marcusb ◴[03 Apr 25 22:55 UTC] No.43576459[source]▶

>>43576317 #

For some sites, things are a lot worse. See, for example, Jonathan Corbet's report[0].

0 - https://social.kernel.org/notice/AqJkUigsjad3gQc664

replies(1): >>43596528 #

4. lmz ◴[04 Apr 25 02:23 UTC] No.43577683[source]▶

>>43576317 #

How can you say it's $0 in compute without knowing if the data returned required any computation?

replies(1): >>43596554 #

5. nonrandomstring ◴[04 Apr 25 07:39 UTC] No.43579340[source]▶

>>43576317 #

Didn't rachelbytheebay post recently that her blog was being swamped? I've heard that from a few self-hosting bloggers now. And Wikipedia has recently said more than half of traffic is noe bots. ARe you claiming this isn't a real problem?

replies(1): >>43596544 #

6. NavinF ◴[05 Apr 25 20:22 UTC] No.43596528{3}[source]▶

>>43576459 #

He provides no info. req/s? 95%ile mbps? How does he know the requests come from an "AI-scraper" as opposed to a normal L7 DDoS? LWN is a pretty simple site, it should be easy to saturate 10G ports

7. NavinF ◴[05 Apr 25 20:24 UTC] No.43596544{3}[source]▶

>>43579340 #

How exactly can a blog get swamped? It takes ~0 compute per request. Yes I'm claiming this is a fake problem

8. NavinF ◴[05 Apr 25 20:25 UTC] No.43596554{3}[source]▶

>>43577683 #

Look at the sibling replies. All the kvetching comes from blogs and simple websites, not the ones that consume compute per request

↑