Most active commenters

9rx(4)
immibis(3)
smcin(3)
InsideOutSanta(3)

Popular/hot comments

>>43708581 #

Kaggle and the Wikimedia Foundation are partnering on open data

(blog.google)

1. toomuchtodo ◴[16 Apr 25 17:38 UTC] No.43708221[source]▶

>>43707768 (OP) #

This sounds good to take the ML/AI consumption load off Wikimedia infra?

replies(1): >>43708252 #

2. immibis ◴[16 Apr 25 17:41 UTC] No.43708252[source]▶

>>43708221 #

The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

replies(2): >>43708581 #>>43710312 #

3. 0cf8612b2e1e ◴[16 Apr 25 17:58 UTC] No.43708451[source]▶

>>43707768 (OP) #

I wish the Kaggle site were better. Unnecessary amounts of JS to browse a forum.

replies(1): >>43709051 #

4. riyanapatel ◴[16 Apr 25 18:03 UTC] No.43708519[source]▶

>>43707768 (OP) #

I like the concept of Kaggle and appreciate it - I also do agree that UI aspects hinder me from taking the time to explore its capabilities. Hoping this new partnership helps structure data for me.

5. parpfish ◴[16 Apr 25 18:10 UTC] No.43708581{3}[source]▶

>>43708252 #

I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.

replies(4): >>43708767 #>>43708888 #>>43708995 #>>43710626 #

6. jvanderbot ◴[16 Apr 25 18:13 UTC] No.43708608[source]▶

>>43708596 #

Could you be more specific? For almost all things Wikipedia has reasonably up to date information.

replies(1): >>43709086 #

7. ashvardanian ◴[16 Apr 25 18:20 UTC] No.43708677[source]▶

>>43707768 (OP) #

It's a good start, but I was hoping for more data. Currently, it's only around 114 GB across 2 languages (<https://www.kaggle.com/datasets/wikimedia-foundation/wikiped...>):

  - English: "Size of uncompressed dataset: 79.57 GB chunked by max 2.15GB."
  - French: "Size of the uncompressed dataset: 34.01 GB chunked by max 2.15GB."

In 2025, the standards for ML datasets are quite high.

replies(1): >>43709776 #

8. mrbungie ◴[16 Apr 25 18:28 UTC] No.43708767{4}[source]▶

>>43708581 #

Wikimedia or someone else could offer some kind of MCP service/proxy/whatever for real-time data consumption (i.e. for use cases where the dump data is not useful enough), billed by usage.

replies(1): >>43710304 #

9. smcin ◴[16 Apr 25 18:34 UTC] No.43708816[source]▶

>>43708596 #

Co-opting of which data, from which groups/editors, on which topics?

I think you're referring to the Wikimedia governance change vote (https://news.ycombinator.com/item?id=41049562), not acceptance of content:

"How the Regime Captured Wikipedia" (piratewires.com) https://news.ycombinator.com/item?id=41167891

... or more recently, content written by/assisted by AI.

Or else what?

10. jsheard ◴[16 Apr 25 18:41 UTC] No.43708888{4}[source]▶

>>43708581 #

The bots which query in response to user prompts aren't really the issue. The really obnoxious ones just crawl the entire web aimlessly looking for training data, and wikis or git repos with huge edit histories and on-demand generated diffs are a worst case scenario for that because even if a crawler only visits each page once, there's a near-infinite number of "unique" pages to visit.

11. philipkglass ◴[16 Apr 25 18:51 UTC] No.43708995{4}[source]▶

>>43708581 #

The raw wiki dumps contain "wikitext" markup that is significantly different from the nice readable pages you see while browsing Wikipedia.

Compare:

https://en.wikipedia.org/wiki/Transistor

with the raw markup seen in

https://en.wikipedia.org/w/index.php?title=Transistor&action...

That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).

replies(1): >>43709036 #

12. freeone3000 ◴[16 Apr 25 18:54 UTC] No.43709036{5}[source]▶

>>43708995 #

Just run your own copy of the wikipedia code. It’ll be cheaper than whatever inference you’re doing.

replies(1): >>43711015 #

13. smcin ◴[16 Apr 25 18:56 UTC] No.43709051[source]▶

>>43708451 #

If enough of us report it

('Issues: Other' https://www.kaggle.com/contact#/other/issue )

they might do something about it.

replies(1): >>43709728 #

14. 9rx ◴[16 Apr 25 18:58 UTC] No.43709086{3}[source]▶

>>43708608 #

Except, surprisingly, on the one topic you know very well. Then it is outdated, misleading, or even incorrect. But, hey, at least the rest is in great shape!

replies(2): >>43709247 #>>43709260 #

15. bk496 ◴[16 Apr 25 19:06 UTC] No.43709153[source]▶

>>43707768 (OP) #

It would be cool if all the HTML tables on Wikipedia were put under individual datasets

16. bilsbie ◴[16 Apr 25 19:07 UTC] No.43709171[source]▶

>>43707768 (OP) #

Wasn’t this data always available?

17. nonameiguess ◴[16 Apr 25 19:09 UTC] No.43709189[source]▶

>>43708596 #

I think people miss the point of Wikipedia.

I don't know what they're expecting others to use this data for, but if it's the same old same old LLM training data scraping, then you've got a perfectly good repository of syntactically correct, semantically coherent strings of characters and words in whatever languages Wikipedia supports. That is entirely reliable data. Whether or not it's also factually accurate doesn't matter. Language modeling doesn't require factual accuracy. That comes from some later training step if you care about that.

If you're trying to use it as a repository not of language examples but of facts, then recognize the limitations. Wikipedia itself performs no verification, no fact checking, and by design does not assure you that its content is factually accurate. Instead, it assures you that all claims of fact come along with citations of sources that meet some extremely lightweight definition of authority. Thus statements of the form "A claims X" should be viewed as statements that Wikipedia is saying are true. However, statements simply of the form "X" found on Wikipedia are not statements that Wikipedia is claiming are true.

It's up to the consumer of Wikipedia data to recognize this and do what they can with it.

18. mcv ◴[16 Apr 25 19:14 UTC] No.43709247{4}[source]▶

>>43709086 #

That is true for any source of information. It's true for even the highest quality newspapers. It's certainly true for other encyclopedias. It's even true for scientific papers. There's always something they're wrong about. The big question is whether there's a process to fix mistakes, and whether there are perverse incentives to spread incorrect information.

19. InsideOutSanta ◴[16 Apr 25 19:16 UTC] No.43709260{4}[source]▶

>>43709086 #

This hasn't been true in my experience. I find very few examples of Wikipedia being downright wrong. Less popular pages can get outdated, but that is to be expected. Reasonably popular pages are often updated within minutes when news occurs, or new discoveries are made.

If you look up controversial topics, you also have the option of looking at the discussion or edit history to get a grasp of where the disagreements are.

When I made simple, objective, anonymous contributions, they were usually also accepted.

People seem unreasonably upset with Wikipedia for reasons unclear to me. Despite all the political pressure it constantly receives, it's a great resource that has largely stuck to its mission of creating a free, verifiable, neutral encyclopedia. It's probably the best thing the web has created.

replies(1): >>43709280 #

20. 9rx ◴[16 Apr 25 19:17 UTC] No.43709280{5}[source]▶

>>43709260 #

> People seem unreasonably upset with Wikipedia for reasons unclear to me.

Like who? I've never met anyone who has been upset with Wikipedia. I don't suppose the vast majority of people I know ever think about Wikipedia at all. I expect some of them don't even know what Wikipedia is.

I might buy "a person I once encountered was upset with Wikipedia one time". Stranger things have happened. But as a trend across people...?

replies(1): >>43709289 #

21. InsideOutSanta ◴[16 Apr 25 19:18 UTC] No.43709289{6}[source]▶

>>43709280 #

Well, like you. Why else would you post such a sarcastic comment?

replies(1): >>43709326 #

22. 9rx ◴[16 Apr 25 19:23 UTC] No.43709326{7}[source]▶

>>43709289 #

Like me? I am quite visibly a computer. How did you manage to mistake my form as being a person?

replies(1): >>43709399 #

23. InsideOutSanta ◴[16 Apr 25 19:30 UTC] No.43709399{8}[source]▶

>>43709326 #

Perhaps the entity at the root of this thread is not a computer? Either way, I don't quite understand how I got involved in this peculiar discussion, and I don't wish to continue it. Enjoy your bits.

replies(1): >>43709428 #

24. 9rx ◴[16 Apr 25 19:33 UTC] No.43709428{9}[source]▶

>>43709399 #

Perhaps. But,

1. People usually have distinct characteristics (e.g. eyes, nose mouth arranged in a certain way), so there is rarely confusion as to whether or not something is a person. What has cast your doubt?

2. Nothing, person, computer, or otherwise is upset with Wikipedia in this thread. I suspect that, emphasized by the non-response, nothing has ever been upset with Wikipedia in any time or place. What would there be to be upset about, even if only theoretically?

25. ◴[16 Apr 25 19:48 UTC] No.43709582[source]▶

>>43707768 (OP) #

26. astrange ◴[16 Apr 25 20:02 UTC] No.43709728{3}[source]▶

>>43709051 #

There's not really much connection between asking someone to redo their whole website and them actually doing it. That seems like work.

replies(1): >>43711834 #

27. yorwba ◴[16 Apr 25 20:07 UTC] No.43709776[source]▶

>>43708677 #

I guess it's limited to only two languages because each version of Wikipedia has its own set of templates and they want to make sure they can render them correctly to JSON before releasing the data.

As for the size, it's small compared to the training data of most LLMs, but large relative to their context size. Probably best used for retrieval-augmented generation or similar information retrieval applications.

28. ipaddr ◴[16 Apr 25 20:56 UTC] No.43710304{5}[source]▶

>>43708767 #

Does any repo exist with an updated bot list to block these bot website killers

29. squigz ◴[16 Apr 25 20:57 UTC] No.43710312{3}[source]▶

>>43708252 #

I'm confused. Are you suggesting that the AI companies actively participate in malicious DDOS campaigns against Wikimedia, for no constructive reason?

Is there a source on this?

replies(1): >>43713522 #

30. noosphr ◴[16 Apr 25 21:32 UTC] No.43710626{4}[source]▶

>>43708581 #

You'd assume wrong.

I was at an interview for a tier one AI lab and the pm I was taking to refused to believe that the torrent dumps from Wikipedia were fresh and usable for training.

When you spend all your time fighting bot detection measures it's hard to imagine someone willingly putting up their data out there for free.

replies(2): >>43710794 #>>43714217 #

31. kmeisthax ◴[16 Apr 25 21:50 UTC] No.43710794{5}[source]▶

>>43710626 #

As someone who has actually tried scraping Wikimedia Commons for AI training[0], they're correct only in the most literal sense. Wikitext is effectively unparseable, so just using the data dump directly is a bad idea.

The correct way to do this is to stand up a copy of MediaWiki on your own infra and then scrape that. That will give you shittons of HTML to parse and tokenize. If you can't work with that, then you're not qualified to do this kind of thing, sorry.

[0] If you're wondering, I was scraping Wikimedia Commons directly from their public API, from my residential IP with my e-mail address in the UA. This was primarily out of laziness, but I believe this is the way you're "supposed" to use the API.

Yes, I did try to work with Wikitext directly, and yes that is a terrible idea.

replies(1): >>43710871 #

32. noosphr ◴[16 Apr 25 22:01 UTC] No.43710871{6}[source]▶

>>43710794 #

This is starting to get into the philosophical question of what training data should look like.

From the same set of interviews I made the point that the only way to meaningfully extract the semantics of a page meant for human consumption is to use a vision model that uses typesetting as a guide for structure.

The perfect example was the contract they sent, which looked completely fine, but was a word document with only wysiwyg formatting, e.g. headings were just extra large bold text rather than marked up as heading. If you used the programmatically extracted text as training data you'd be in trouble.

33. paulryanrogers ◴[16 Apr 25 22:23 UTC] No.43711015{6}[source]▶

>>43709036 #

IDK why this was downvoted. Wikimedia wiki text can be transformed with some REs. Not exactly fast but likely far easier than playing cat and mouse with bot blockers.

34. BigParm ◴[16 Apr 25 22:52 UTC] No.43711194[source]▶

>>43707768 (OP) #

I was paying experts in a wide variety of industries for interviews in which I meticulously documented and organized the comprehensive role of the human in that line of work. I thought I was building a treasure chest, but it turns out nobody wants that shit.

Anyways, just a story on small-time closed data for reference.

35. smcin ◴[17 Apr 25 00:34 UTC] No.43711834{4}[source]▶

>>43709728 #

There absolutely is, in this case. Kaggle was acquired by Google in 2017, and is a showcase for compute on Google cloud, Google Colab, Kaggle Kernels. Fixing the JS on their forums would be a rounding error in their budget.

(Also, FYI, I've previously posted feedback pieces in Kaggle forums that got a very warm direct response from the executives, although that was before the acquisition.)

So, for the average website, you'd be right, but not for Google Cloud/Colab's showcase property.

https://news.ycombinator.com/item?id=13822675

36. kbelder ◴[17 Apr 25 05:53 UTC] No.43713522{4}[source]▶

>>43710312 #

Not maliciousness. Incompetence.

Bot traffic is notoriously stupid, reloading the same pages over and over, surging one hour and then gone the next, getting stuck in loops, not understand html response codes... It's only gotten worse with all the AI scrapers. Somehow, they seem even more poorly written than the search engine bots.

replies(1): >>43714226 #

37. immibis ◴[17 Apr 25 08:05 UTC] No.43714217{5}[source]▶

>>43710626 #

Sounds like they're breaking the CFAA and should be criminally prosecuted.

38. immibis ◴[17 Apr 25 08:06 UTC] No.43714226{5}[source]▶

>>43713522 #

Mine disappeared after about a week of serving them all the same dummy page on every request. They were fetching the images on the dummy page once for each time they fetched the page...

↑