Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jamesponddotco/wikiextract
[READ-ONLY] A word extractor for Wikipedia articles.
https://github.com/jamesponddotco/wikiextract
crawler crawling diceware go wikipedia wikipedia-crawler word-extraction
Last synced: about 1 month ago
JSON representation
[READ-ONLY] A word extractor for Wikipedia articles.
- Host: GitHub
- URL: https://github.com/jamesponddotco/wikiextract
- Owner: jamesponddotco
- License: gpl-2.0
- Created: 2023-04-15T20:46:24.000Z (over 1 year ago)
- Default Branch: trunk
- Last Pushed: 2024-04-08T23:13:30.000Z (9 months ago)
- Last Synced: 2024-04-09T00:41:30.220Z (9 months ago)
- Topics: crawler, crawling, diceware, go, wikipedia, wikipedia-crawler, word-extraction
- Language: Go
- Homepage: https://sr.ht/~jamesponddotco/wikiextract/
- Size: 38.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# `wikiextract`
[![builds.sr.ht status](https://builds.sr.ht/~jamesponddotco/wikiextract.svg)](https://builds.sr.ht/~jamesponddotco/wikiextract?)
`wikiextract` is a word extractor for Wikipedia articles. It can extract
words bigger than 4 characters from a given Wikipedia page or list of
pages and save them to a file you can later use as the source for
generating [diceware passwords](https://en.wikipedia.org/wiki/Diceware).## Installation
### From source
First install the dependencies:
- Go 1.22 or above.
- make.
- [scdoc](https://git.sr.ht/~sircmpwn/scdoc).Switch to the latest stable tag, `v1.0.0`, then compile and install:
```bash
git checkout v1.0.0
make
sudo make install
```## Usage
```bash
$ wikiextract --help
NAME:
wikiextract - a simple word extractor for Wikipedia articlesUSAGE:
wikiextract [global options]VERSION:
1.0.0GLOBAL OPTIONS:
--input-url value, -u value [ --input-url value, -u value ] the URL of the Wikipedia page
--input-file value, -f value a file containing a list of URLs
--output value, -o value the path to the output file
--help, -h show help
--version, -v print the version$ wikiextract -u 'https://en.wikipedia.org/wiki/Wikipedia' -o 'output.txt'
```See _wikiextract(1)_ after installing for more information.
## Contributing
Anyone can help make `wikiextract` better. Send patches on the [mailing
list](https://lists.sr.ht/~jamesponddotco/wikiextract-devel) and report
bugs on the [issue
tracker](https://todo.sr.ht/~jamesponddotco/wikiextract).You must sign-off your work using `git commit --signoff`. Follow the
[Linux kernel developer's certificate of
origin](https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin)
for more details.All contributions are made under [the GPL-2.0 license](LICENSE.md).
## Resources
The following resources are available:
- [Support and general discussions](https://lists.sr.ht/~jamesponddotco/wikiextract-discuss).
- [Patches and development related questions](https://lists.sr.ht/~jamesponddotco/wikiextract-devel).
- [Instructions on how to prepare patches](https://git-send-email.io/).
- [Feature requests and bug reports](https://todo.sr.ht/~jamesponddotco/wikiextract).---
Released under the [GPL-2.0 license](LICENSE.md).