https://github.com/daveshap/PlainTextWikipedia
Convert Wikipedia database dumps into plaintext files
https://github.com/daveshap/PlainTextWikipedia
Last synced: about 1 year ago
JSON representation
Convert Wikipedia database dumps into plaintext files
- Host: GitHub
- URL: https://github.com/daveshap/PlainTextWikipedia
- Owner: daveshap
- License: mit
- Archived: true
- Created: 2020-11-24T21:28:42.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-05-23T14:20:07.000Z (about 5 years ago)
- Last Synced: 2024-08-01T22:01:58.240Z (almost 2 years ago)
- Language: Python
- Size: 1.93 MB
- Stars: 301
- Watchers: 9
- Forks: 41
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-rainmana - daveshap/PlainTextWikipedia - Convert Wikipedia database dumps into plaintext files (Python)
README
# PlainTextWikipedia
Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on [Kaggle Datasets](https://www.kaggle.com/ltcmdrdata/plain-text-wikipedia-202011)
## QUICK START
1. Download and unzip a Wikipedia dump (see Data Sources below) make sure you get a monolithic XML file
2. Open up `wiki_to_text.py` and edit the filename to point at your XML file. Also update the savedir location
3. Run `wiki_to_text.py` - it should take about 2.5 days to run, with some variation based on your CPU and storage speed
## Data Sources
There are two primary data sources you'll want to use. See the table below for the root url.
| Name | Description | Link |
|---|---|---|
| Simplified English Wikipedia | This is only about 1GB and therefore is a great test set | [https://dumps.wikimedia.org/simplewiki/](https://dumps.wikimedia.org/simplewiki/) |
| English Wikipedia | This is all of Wikipedia, so about 80GB unpacked | [https://dumps.wikimedia.org/enwiki/](https://dumps.wikimedia.org/enwiki/)
Navigate into the latest dump. You're likley looking for the very first file in the download section. They will look something like this:
- `enwiki-20210401-pages-articles-multistream.xml.bz2 18.1 GB`
- `simplewiki-20210401-pages-articles-multistream.xml.bz2 203.5 MB`
Download and extract these to a storage directory. I usually shorten the folder name and filename.
## Legal
https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content
Wikipedia is published under [Creative Commons Attribution Share-Alike license (CC-BY-SA)](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License).
My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.