Wikipedia Text Dump

 
 

When seeking information on the web Wikipedia is an essential source. The English version features over 4 million articles. Studies show that it is also the number one source of plagiarism, so when we created our new translational plagiarism checker, we looked for a way to add this vast source of information to our database. We found that it is impossible to download the whole database in an easy to handle format (like HTML or plain text) and that all the available Mediawiki converters had some flaws. So we have written a Mediawiki XML dump to plain text converter, which we run every time a new database dump appears on the site and publish the text version for everybody to use.

You can download it from here: http://kopiwiki.dsd.sztaki.hu/

 

Department