Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tomeraberbach/wikipedia-ngrams
📚 A Kotlin project which extracts ngram counts from Wikipedia data dumps.
https://github.com/tomeraberbach/wikipedia-ngrams
cli extracts-ngram-counts kotlin ngram ngrams nlp wikiextractor wikipedia wikipedia-corpus wikipedia-data-dump wikipedia-dump wikipedia-ngrams
Last synced: 6 days ago
JSON representation
📚 A Kotlin project which extracts ngram counts from Wikipedia data dumps.
- Host: GitHub
- URL: https://github.com/tomeraberbach/wikipedia-ngrams
- Owner: TomerAberbach
- License: mit
- Created: 2019-04-25T00:38:02.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-07-03T04:14:32.000Z (over 1 year ago)
- Last Synced: 2024-10-12T01:23:30.911Z (about 1 month ago)
- Topics: cli, extracts-ngram-counts, kotlin, ngram, ngrams, nlp, wikiextractor, wikipedia, wikipedia-corpus, wikipedia-data-dump, wikipedia-dump, wikipedia-ngrams
- Language: Kotlin
- Homepage:
- Size: 13.7 KB
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license
Awesome Lists containing this project
README
# Wikipedia Ngrams
> A Kotlin project which extracts ngram counts from Wikipedia data dumps.
## Download
Download the latest jar from [releases](https://github.com/TomerAberbach/wikipedia-ngrams/releases).
You can also clone the repository and build with [maven](https://maven.apache.org/download.cgi):
```sh
$ git clone https://github.com/TomerAberbach/wikipedia-ngrams.git
$ cd wikipedia-ngrams
$ mvn package
```A fat jar called `wikipedia-ngrams-VERSION-jar-with-dependencies.jar` will be in a newly created `target` directory.
## Usage
DISCLAIMER: Many of these commands will take a very long time to run.
Download the latest [Wikipedia data dump](https://meta.wikimedia.org/wiki/Data_dumps/Download_tools) using `wget`:
```sh
$ wget -np -nd -c -A 7z https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2
```Or using `axel`:
```sh
$ axel --num-connections=3 https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2
```To speed up the download you should replace `https://dumps.wikimedia.org` with the [mirror](https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps) closest to you.
Once downloaded, extract the zipped data using a tool like `lbzip2` and feed the resulting `enwiki-latest-pages-articles.xml` file into [WikiExtractor](https://github.com/attardi/wikiextractor):
```sh
$ python3 WikiExtractor.py --no_templates --json enwiki-latest-pages-articles.xml
```This will output a large directory structure with root directory `text`.
Finally, run `wikipedia-ngrams.jar` with the desired ngram "n" (2 in this example) and the path to directory output of [WikiExtractor](https://github.com/attardi/wikiextractor):
```sh
$ java -jar wikipedia-ngrams.jar 2 text
```Note that you may need to increase the maximum heap size and/or disable GC overhead limit.
`contexts.txt` and `2-grams.txt` files will be in an `out` directory. `contexts.txt` caches the "sentences" in the Wikipedia data dump. To use this cache in your next run (with n = 3 for example), run the following command:
```sh
$ java -jar wikipedia-ngrams.jar 3 out/contexts.txt
```The outputted files will not be sorted. Use a command-line tool like `sort` to do so.
Note that `OutOfMemoryError` is not a legitimate issue. The burden is on the user to allocate enough heap space and have a large enough RAM (consider allocating a larger [swap file](https://linuxize.com/post/create-a-linux-swap-file)).
## Dependencies
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/index.html)
- [fastutil](http://fastutil.di.unimi.it)## License
[MIT](https://github.com/TomerAberbach/wikipedia-ngrams/blob/main/license) © [Tomer Aberbach](https://github.com/TomerAberbach)