Kaggle and the Wikimedia Foundation are partnering on open data

(blog.google)

136 points xnx | 2 comments | 16 Apr 25 17:05 UTC | HN request time: 0.404s | source

Show context

toomuchtodo ◴[16 Apr 25 17:38 UTC] No.43708221[source]▶

>>43707768 (OP) #

This sounds good to take the ML/AI consumption load off Wikimedia infra?

replies(1): >>43708252 #

immibis ◴[16 Apr 25 17:41 UTC] No.43708252[source]▶

>>43708221 #

The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

replies(2): >>43708581 #>>43710312 #

parpfish ◴[16 Apr 25 18:10 UTC] No.43708581[source]▶

>>43708252 #

I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.

replies(4): >>43708767 #>>43708888 #>>43708995 #>>43710626 #

noosphr ◴[16 Apr 25 21:32 UTC] No.43710626[source]▶

>>43708581 #

You'd assume wrong.

I was at an interview for a tier one AI lab and the pm I was taking to refused to believe that the torrent dumps from Wikipedia were fresh and usable for training.

When you spend all your time fighting bot detection measures it's hard to imagine someone willingly putting up their data out there for free.

replies(2): >>43710794 #>>43714217 #

1. kmeisthax ◴[16 Apr 25 21:50 UTC] No.43710794[source]▶

>>43710626 #

As someone who has actually tried scraping Wikimedia Commons for AI training[0], they're correct only in the most literal sense. Wikitext is effectively unparseable, so just using the data dump directly is a bad idea.

The correct way to do this is to stand up a copy of MediaWiki on your own infra and then scrape that. That will give you shittons of HTML to parse and tokenize. If you can't work with that, then you're not qualified to do this kind of thing, sorry.

[0] If you're wondering, I was scraping Wikimedia Commons directly from their public API, from my residential IP with my e-mail address in the UA. This was primarily out of laziness, but I believe this is the way you're "supposed" to use the API.

Yes, I did try to work with Wikitext directly, and yes that is a terrible idea.

replies(1): >>43710871 #

2. noosphr ◴[16 Apr 25 22:01 UTC] No.43710871[source]▶

>>43710794 (TP) #

This is starting to get into the philosophical question of what training data should look like.

From the same set of interviews I made the point that the only way to meaningfully extract the semantics of a page meant for human consumption is to use a vision model that uses typesetting as a guide for structure.

The perfect example was the contract they sent, which looked completely fine, but was a word document with only wysiwyg formatting, e.g. headings were just extra large bold text rather than marked up as heading. If you used the programmatically extracted text as training data you'd be in trouble.

↑