The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.
Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.
- English: "Size of uncompressed dataset: 79.57 GB chunked by max 2.15GB."
- French: "Size of the uncompressed dataset: 34.01 GB chunked by max 2.15GB."
In 2025, the standards for ML datasets are quite high.I think you're referring to the Wikimedia governance change vote (https://news.ycombinator.com/item?id=41049562), not acceptance of content:
"How the Regime Captured Wikipedia" (piratewires.com) https://news.ycombinator.com/item?id=41167891
... or more recently, content written by/assisted by AI.
Or else what?
Compare:
https://en.wikipedia.org/wiki/Transistor
with the raw markup seen in
https://en.wikipedia.org/w/index.php?title=Transistor&action...
That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says
This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).
('Issues: Other' https://www.kaggle.com/contact#/other/issue )
they might do something about it.
I don't know what they're expecting others to use this data for, but if it's the same old same old LLM training data scraping, then you've got a perfectly good repository of syntactically correct, semantically coherent strings of characters and words in whatever languages Wikipedia supports. That is entirely reliable data. Whether or not it's also factually accurate doesn't matter. Language modeling doesn't require factual accuracy. That comes from some later training step if you care about that.
If you're trying to use it as a repository not of language examples but of facts, then recognize the limitations. Wikipedia itself performs no verification, no fact checking, and by design does not assure you that its content is factually accurate. Instead, it assures you that all claims of fact come along with citations of sources that meet some extremely lightweight definition of authority. Thus statements of the form "A claims X" should be viewed as statements that Wikipedia is saying are true. However, statements simply of the form "X" found on Wikipedia are not statements that Wikipedia is claiming are true.
It's up to the consumer of Wikipedia data to recognize this and do what they can with it.
If you look up controversial topics, you also have the option of looking at the discussion or edit history to get a grasp of where the disagreements are.
When I made simple, objective, anonymous contributions, they were usually also accepted.
People seem unreasonably upset with Wikipedia for reasons unclear to me. Despite all the political pressure it constantly receives, it's a great resource that has largely stuck to its mission of creating a free, verifiable, neutral encyclopedia. It's probably the best thing the web has created.
Like who? I've never met anyone who has been upset with Wikipedia. I don't suppose the vast majority of people I know ever think about Wikipedia at all. I expect some of them don't even know what Wikipedia is.
I might buy "a person I once encountered was upset with Wikipedia one time". Stranger things have happened. But as a trend across people...?
1. People usually have distinct characteristics (e.g. eyes, nose mouth arranged in a certain way), so there is rarely confusion as to whether or not something is a person. What has cast your doubt?
2. Nothing, person, computer, or otherwise is upset with Wikipedia in this thread. I suspect that, emphasized by the non-response, nothing has ever been upset with Wikipedia in any time or place. What would there be to be upset about, even if only theoretically?
As for the size, it's small compared to the training data of most LLMs, but large relative to their context size. Probably best used for retrieval-augmented generation or similar information retrieval applications.
I was at an interview for a tier one AI lab and the pm I was taking to refused to believe that the torrent dumps from Wikipedia were fresh and usable for training.
When you spend all your time fighting bot detection measures it's hard to imagine someone willingly putting up their data out there for free.
The correct way to do this is to stand up a copy of MediaWiki on your own infra and then scrape that. That will give you shittons of HTML to parse and tokenize. If you can't work with that, then you're not qualified to do this kind of thing, sorry.
[0] If you're wondering, I was scraping Wikimedia Commons directly from their public API, from my residential IP with my e-mail address in the UA. This was primarily out of laziness, but I believe this is the way you're "supposed" to use the API.
Yes, I did try to work with Wikitext directly, and yes that is a terrible idea.
From the same set of interviews I made the point that the only way to meaningfully extract the semantics of a page meant for human consumption is to use a vision model that uses typesetting as a guide for structure.
The perfect example was the contract they sent, which looked completely fine, but was a word document with only wysiwyg formatting, e.g. headings were just extra large bold text rather than marked up as heading. If you used the programmatically extracted text as training data you'd be in trouble.
Anyways, just a story on small-time closed data for reference.
(Also, FYI, I've previously posted feedback pieces in Kaggle forums that got a very warm direct response from the executives, although that was before the acquisition.)
So, for the average website, you'd be right, but not for Google Cloud/Colab's showcase property.
Bot traffic is notoriously stupid, reloading the same pages over and over, surging one hour and then gone the next, getting stuck in loops, not understand html response codes... It's only gotten worse with all the AI scrapers. Somehow, they seem even more poorly written than the search engine bots.