AI companies cause most of traffic on forums

(pod.geraspora.de)

770 points ta988 | 3 comments | 30 Dec 24 14:37 UTC | HN request time: 0.001s | source

Show context

buro9 ◴[30 Dec 24 17:36 UTC] No.42551470[source]▶

>>42549624 (OP) #

Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

replies(9): >>42551536 #>>42551667 #>>42551719 #>>42551781 #>>42551798 #>>42551877 #>>42552584 #>>42552786 #>>42565241 #

coldpie ◴[30 Dec 24 17:42 UTC] No.42551536[source]▶

>>42551470 #

You know, at this point, I wonder if an allowlist would work better.

replies(3): >>42551619 #>>42551671 #>>42552608 #

buro9 ◴[30 Dec 24 17:52 UTC] No.42551671[source]▶

>>42551536 #

I have thought about writing such a thing...

1. A proxy that looks at HTTP Headers and TLS cipher choices

2. An allowlist that records which browsers send which headers and selects which ciphers

3. A dynamic loading of the allowlist into the proxy at some given interval

New browser versions or updates to OSs would need the allowlist updating, but I'm not sure it's that inconvenient and could be done via GitHub so people could submit new combinations.

I'd rather just say "I trust real browsers" and dump the rest.

Also I noticed a far simpler block, just block almost every request whose UA claims to be "compatible".

replies(2): >>42552152 #>>42575828 #

1. qazxcvbnmlp ◴[30 Dec 24 18:36 UTC] No.42552152[source]▶

>>42551671 #

Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.

To truly say “I trust real browsers” requires a signal of integrity of the user and browser such as cryptographic device attestation of the browser. .. which has to be centrally verified. Which is also not great.

replies(1): >>42552259 #

2. coldpie ◴[30 Dec 24 18:46 UTC] No.42552259[source]▶

>>42552152 (TP) #

> Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.

Forcing Facebook & Co to play the adversary role still seems like an improvement over the current situation. They're clearly operating illegitimately if they start spoofing real user agents to get around bot blocking capabilities.

replies(1): >>42552775 #

3. Terr_ ◴[30 Dec 24 19:54 UTC] No.42552775[source]▶

>>42552259 #

I'm imagining a quixotic terms of service, where "by continuing" any bot access grants the site-owner a perpetual and irrevocable license to use and relicense all data, works, or other products resulting from any use of the crawled content, including but not limited to cases where that content was used in a statistical text generative model.

↑