https://github.com/mlabs-haskell/wikipedia_parser
https://github.com/mlabs-haskell/wikipedia_parser
Last synced: over 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/mlabs-haskell/wikipedia_parser
- Owner: mlabs-haskell
- Created: 2023-11-30T05:23:51.000Z (over 2 years ago)
- Default Branch: staging
- Last Pushed: 2025-02-03T01:51:10.000Z (over 1 year ago)
- Last Synced: 2025-02-03T02:30:07.077Z (over 1 year ago)
- Language: Rust
- Size: 610 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Wikipedia Parser
This is a tool for processing Wikipedia articles and extracting important information from them as plaintext.
## Instructions for use
1. Run `./download.sh` to download all of Wikipedia as a single xml file. This will likely take a long time.
2. Install Rust on your system via [these instructions](https://rustup.rs/).
3. Install `just` with `cargo install just`.
4. Run `just extract-links` to create the graph of Wikipedia.
5. Run `just extract-contents` to parse the contents of the Wikipedia articles. This will also take a long time
6. Run `just extract-subgraph ` to produce the list of all articles within `degrees of separation` of the root. For instance, if I wanted to find all articles within 5 degrees of separation from the article for RNA, I would run `just extract-subgraph RNA 5`.