Kaggle and the Wikimedia Foundation are partnering on open data

(blog.google)

136 points xnx | 1 comments | 16 Apr 25 17:05 UTC | HN request time: 0s | source

Show context

toomuchtodo ◴[16 Apr 25 17:38 UTC] No.43708221[source]▶

>>43707768 (OP) #

This sounds good to take the ML/AI consumption load off Wikimedia infra?

replies(1): >>43708252 #

immibis ◴[16 Apr 25 17:41 UTC] No.43708252[source]▶

>>43708221 #

The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

replies(2): >>43708581 #>>43710312 #

parpfish ◴[16 Apr 25 18:10 UTC] No.43708581[source]▶

>>43708252 #

I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.

replies(4): >>43708767 #>>43708888 #>>43708995 #>>43710626 #

philipkglass ◴[16 Apr 25 18:51 UTC] No.43708995[source]▶

>>43708581 #

The raw wiki dumps contain "wikitext" markup that is significantly different from the nice readable pages you see while browsing Wikipedia.

Compare:

https://en.wikipedia.org/wiki/Transistor

with the raw markup seen in

https://en.wikipedia.org/w/index.php?title=Transistor&action...

That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).

replies(1): >>43709036 #

freeone3000 ◴[16 Apr 25 18:54 UTC] No.43709036[source]▶

>>43708995 #

Just run your own copy of the wikipedia code. It’ll be cheaper than whatever inference you’re doing.

replies(1): >>43711015 #

1. paulryanrogers ◴[16 Apr 25 22:23 UTC] No.43711015[source]▶

>>43709036 #

IDK why this was downvoted. Wikimedia wiki text can be transformed with some REs. Not exactly fast but likely far easier than playing cat and mouse with bot blockers.

↑