←back to thread

136 points xnx | 1 comments | | HN request time: 0.204s | source
Show context
toomuchtodo ◴[] No.43708221[source]
This sounds good to take the ML/AI consumption load off Wikimedia infra?
replies(1): >>43708252 #
immibis ◴[] No.43708252[source]
The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

replies(2): >>43708581 #>>43710312 #
parpfish ◴[] No.43708581[source]
I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.
replies(4): >>43708767 #>>43708888 #>>43708995 #>>43710626 #
mrbungie ◴[] No.43708767[source]
Wikimedia or someone else could offer some kind of MCP service/proxy/whatever for real-time data consumption (i.e. for use cases where the dump data is not useful enough), billed by usage.
replies(1): >>43710304 #
1. ipaddr ◴[] No.43710304[source]
Does any repo exist with an updated bot list to block these bot website killers