You can download twice-monthly database dumps, but they consist of the raw wikit...

parpfish · 2025-10-21T23:48:43 1761090523

as somebody that has wrassled with the wikipedia dumps a number of times, i don't understand why wiki doesn't release some sort of sdk that gives you the 'official' parse

philipkglass · 2025-10-22T15:29:30 1761146970

I have wrestled with it too. I believe it's because wikitext is an ad-hoc format that evolved so that the only 100% correct parser/renderer is the MediaWiki implementation. It's like asking for an SDK that correctly parses Perl. Only Perl can do that.

There are a bunch of mainly-compatible third party parsers in various languages. The best one I've found so far is Sweble but even it mishandles a small percentage of rare cases.

skywal_l · 2025-10-22T07:10:58 1761117058

This. I tried that a few years ago and fell off my chair when I started to realized how DYI the thing is. It's a bunch of unofficial scripts and half-assed out of date help pages.

At the time I though, well it's a bunch of hippies with a small budget, who can blame them? Now I learn that there is 600 of them with a budget in the hundreds of millions??

This is becoming another Mozilla foundation...

Gander5739 · 2025-10-22T07:07:14 1761116834

There are also dumps of Wikipedia in html format.

yorwba · 2025-10-22T07:47:50 1761119270

Do you mean the discontinued Enterprise HTML dumps https://dumps.wikimedia.org/other/enterprise_html/ or the even older discontinued static HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/current/... or is there another set of dumps I'm not aware of?