Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can download twice-monthly database dumps, but they consist of the raw wikitext, so you need to do a bunch of extra work to render templates and stuff. Meanwhile, if you write a generic scraper, it can connect to Wikipedia like it connects to any other website and get the correctly-rendered HTML. People who aren't interested in Wikipedia specifically but want to download pretty much the entire internet unsurprisingly choose the latter option.


as somebody that has wrassled with the wikipedia dumps a number of times, i don't understand why wiki doesn't release some sort of sdk that gives you the 'official' parse


I have wrestled with it too. I believe it's because wikitext is an ad-hoc format that evolved so that the only 100% correct parser/renderer is the MediaWiki implementation. It's like asking for an SDK that correctly parses Perl. Only Perl can do that.

There are a bunch of mainly-compatible third party parsers in various languages. The best one I've found so far is Sweble but even it mishandles a small percentage of rare cases.


This. I tried that a few years ago and fell off my chair when I started to realized how DYI the thing is. It's a bunch of unofficial scripts and half-assed out of date help pages.

At the time I though, well it's a bunch of hippies with a small budget, who can blame them? Now I learn that there is 600 of them with a budget in the hundreds of millions??

This is becoming another Mozilla foundation...


There are also dumps of Wikipedia in html format.


Do you mean the discontinued Enterprise HTML dumps https://dumps.wikimedia.org/other/enterprise_html/ or the even older discontinued static HTML dumps https://dumps.wikimedia.org/other/static_html_dumps/current/... or is there another set of dumps I'm not aware of?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: