Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jamesponddotco/wikiextract

[READ-ONLY] A word extractor for Wikipedia articles.
https://github.com/jamesponddotco/wikiextract

crawler crawling diceware go wikipedia wikipedia-crawler word-extraction

Last synced: about 1 month ago
JSON representation

[READ-ONLY] A word extractor for Wikipedia articles.

Awesome Lists containing this project

README

        

# `wikiextract`

[![builds.sr.ht status](https://builds.sr.ht/~jamesponddotco/wikiextract.svg)](https://builds.sr.ht/~jamesponddotco/wikiextract?)

`wikiextract` is a word extractor for Wikipedia articles. It can extract
words bigger than 4 characters from a given Wikipedia page or list of
pages and save them to a file you can later use as the source for
generating [diceware passwords](https://en.wikipedia.org/wiki/Diceware).

## Installation

### From source

First install the dependencies:

- Go 1.22 or above.
- make.
- [scdoc](https://git.sr.ht/~sircmpwn/scdoc).

Switch to the latest stable tag, `v1.0.0`, then compile and install:

```bash
git checkout v1.0.0
make
sudo make install
```

## Usage

```bash
$ wikiextract --help
NAME:
wikiextract - a simple word extractor for Wikipedia articles

USAGE:
wikiextract [global options]

VERSION:
1.0.0

GLOBAL OPTIONS:
--input-url value, -u value [ --input-url value, -u value ] the URL of the Wikipedia page
--input-file value, -f value a file containing a list of URLs
--output value, -o value the path to the output file
--help, -h show help
--version, -v print the version

$ wikiextract -u 'https://en.wikipedia.org/wiki/Wikipedia' -o 'output.txt'
```

See _wikiextract(1)_ after installing for more information.

## Contributing

Anyone can help make `wikiextract` better. Send patches on the [mailing
list](https://lists.sr.ht/~jamesponddotco/wikiextract-devel) and report
bugs on the [issue
tracker](https://todo.sr.ht/~jamesponddotco/wikiextract).

You must sign-off your work using `git commit --signoff`. Follow the
[Linux kernel developer's certificate of
origin](https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin)
for more details.

All contributions are made under [the GPL-2.0 license](LICENSE.md).

## Resources

The following resources are available:

- [Support and general discussions](https://lists.sr.ht/~jamesponddotco/wikiextract-discuss).
- [Patches and development related questions](https://lists.sr.ht/~jamesponddotco/wikiextract-devel).
- [Instructions on how to prepare patches](https://git-send-email.io/).
- [Feature requests and bug reports](https://todo.sr.ht/~jamesponddotco/wikiextract).

---

Released under the [GPL-2.0 license](LICENSE.md).