The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.
Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.
Compare:
https://en.wikipedia.org/wiki/Transistor
with the raw markup seen in
https://en.wikipedia.org/w/index.php?title=Transistor&action...
That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says
This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).