https://github.com/caellian/wiki-extractor
Wikipedia dump download and cleanup tool
https://github.com/caellian/wiki-extractor
wikipedia wikipedia-scraper wikitext
Last synced: 2 months ago
JSON representation
Wikipedia dump download and cleanup tool
- Host: GitHub
- URL: https://github.com/caellian/wiki-extractor
- Owner: Caellian
- License: gpl-3.0
- Created: 2024-06-15T15:12:17.000Z (12 months ago)
- Default Branch: trunk
- Last Pushed: 2024-06-22T01:35:47.000Z (12 months ago)
- Last Synced: 2025-02-13T07:42:47.403Z (4 months ago)
- Topics: wikipedia, wikipedia-scraper, wikitext
- Language: Rust
- Homepage:
- Size: 65.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# wiki-extractor
[](https://github.com/Caellian/wiki-extractor/actions/workflows/build.yml)
[](https://github.com/Caellian/wiki-extractor#license)A tool that turns Wikipedia dump files into easily consumable data.
There's already a few tools such as this available in the wild, but none of them
properly support the dump XML format nor Mediawiki format and instead work
around them.None of the existing tools utilize the fact that dumps are stored in gzip2
format which can be streamed either. This tool exploits that in order to allow
directly streaming Wikipedia dumps into processed target files,
**without storing intermediate files** which take up a lot of space (>20 GiB;
compressed).Lastly, most existing tools don't actually parse Mediawiki format and instead
use finite-state automata (i.e. regex) in order to scrub files from Mediawiki
formatting and XML tags. This is incredibly faulty and produces very noisy
results, as well as being **much slower** than correctly parsing the articles.
Badly scrapped data negatively affects the quality and size of trained models,
making them both larger and more error-prone.## Features
- Streams dump information directly from mirrors, without requiring the user to
download dumps up-front.
- Decompresses the stream automatically without requiring external tools for
extraction.
- Produces multiple useful outputs at once:
- Sentence/text dump
- Dictionary
- Article metadata (WIP)
- List of page redirections
- Can produce Markdown format if want to train a model on that instead.
- Partial output is still usable as articles are processed one-by-one in
sequence.
- Mirrors prefer you don't parallel download.See the [Samples](./Samples.md) file for a more in-depth overview of produced
files/output.## Usage
Download latest stable binary from [Releases](https://github.com/Caellian/wiki-extractor/releases).
### Configuration
All configuration is done through CLI arguments.
Print arguments with:
```sh
wiki-extractor --help
wiki-extractor local --help
wiki-extractor remote --help
```### Running
- Find a closest [mirror](https://dumps.wikimedia.org/mirrors.html) to make the download faster.
- Used URL should be the part before `/enwiki/latest` part once you locate the dump files.
- Replace the URL in command below with your mirror of choice:```sh
wiki-extractor remote https://dumps.wikimedia.org/ -L en -w latest -o
```- Kick back and relax.
## Contributing
Contributions are very welcome. There's several `TODO` and `FIXME` comments in
the code that might offer ideas for wanted/high priority contributions, and you
can also just contribute what you personally need.## License
This tool is licensed under GPLv3 license. A copy of the license is available in
the [`LICENSE`](./LICENSE) file.This license only applies to the provided source code and binaries built with it
and doesn't apply to generated dump files. Produced dump files are licensed
under the same license as Wikipedia content -
[CC-BY-SA](https://en.wikipedia.org/wiki/Wikipedia:CCBYSA).