Kaggle and the Wikimedia Foundation are partnering on open data

(blog.google)

136 points xnx | 1 comments | 16 Apr 25 17:05 UTC | HN request time: 0.204s | source

Show context

toomuchtodo ◴[16 Apr 25 17:38 UTC] No.43708221[source]▶

>>43707768 (OP) #

This sounds good to take the ML/AI consumption load off Wikimedia infra?

replies(1): >>43708252 #

immibis ◴[16 Apr 25 17:41 UTC] No.43708252[source]▶

>>43708221 #

The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

replies(2): >>43708581 #>>43710312 #

parpfish ◴[16 Apr 25 18:10 UTC] No.43708581[source]▶

>>43708252 #

I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.

replies(4): >>43708767 #>>43708888 #>>43708995 #>>43710626 #

mrbungie ◴[16 Apr 25 18:28 UTC] No.43708767[source]▶

>>43708581 #

Wikimedia or someone else could offer some kind of MCP service/proxy/whatever for real-time data consumption (i.e. for use cases where the dump data is not useful enough), billed by usage.

replies(1): >>43710304 #

1. ipaddr ◴[16 Apr 25 20:56 UTC] No.43710304[source]▶

>>43708767 #

Does any repo exist with an updated bot list to block these bot website killers

↑