https://github.com/jamesdconley/wikidump_parser
Tool for parsing wikipedia dumps into simpler formats
https://github.com/jamesdconley/wikidump_parser
Last synced: 3 months ago
JSON representation
Tool for parsing wikipedia dumps into simpler formats
- Host: GitHub
- URL: https://github.com/jamesdconley/wikidump_parser
- Owner: JamesDConley
- License: unlicense
- Created: 2024-01-19T02:58:30.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-01-21T16:59:46.000Z (over 2 years ago)
- Last Synced: 2025-10-11T16:19:14.625Z (9 months ago)
- Language: Python
- Size: 6.84 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Wikidump Parser
A toolkit for extracting article text from wikipedia dumps.
Features include
- [x] Extracting article names, basic metadata, and article wikitext
- [x] Identifying other articles mentioned in each article (Useful for graphs!)
- [ ] Sorting the article data by mentions # Needs Cleanup
- [ ] Simplifying the wikitext contents # Needs Cleanup
- [ ] Creating a memory mapped object for efficient random access of text # Needs Cleanup
See convert.sh for example usage
# Note on Functionality
I'm cobbling this repo together from several scripts and notebooks I put together.
As of this commit I have not tested the individual scripts or the `convert.sh` but I plan to debug on a new wikipedia dump once it finishes downloading.
Feel free to put in issues for requests/help
Thanks for reading :)