https://github.com/pyk/wikipedia-dumps-extractor
Extract title and content from Wikipedia dumps data
https://github.com/pyk/wikipedia-dumps-extractor
Last synced: about 1 year ago
JSON representation
Extract title and content from Wikipedia dumps data
- Host: GitHub
- URL: https://github.com/pyk/wikipedia-dumps-extractor
- Owner: pyk
- Created: 2018-11-30T13:09:10.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-11-30T13:19:51.000Z (over 7 years ago)
- Last Synced: 2025-02-07T17:22:38.228Z (over 1 year ago)
- Language: Python
- Homepage: https://id.m.wikipedia.org/wiki/Wikipedia:Unduh_basis_data
- Size: 1.95 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Wikipedia Dumps Extractor
This repository contains a python script that I use to extract
title and content from
[Wikipedia dumps data](https://id.m.wikipedia.org/wiki/Wikipedia:Unduh_basis_data).
To run it, clone the repository:
git clone https://github.com/pyk/wikipedia-dumps-extractor.git
Install the dependencies:
pipenv install
Run the script:
pipenv run python extract.py wiki-latest-pages-articles.xml
This script use incremental parsing, so it doesn't consume too much
memory.
Enjoy.