https://github.com/high-moctane/nextword-data

Dataset for nextword.
https://github.com/high-moctane/nextword-data

english nextword

Last synced: 4 months ago
JSON representation

Dataset for nextword.

Host: GitHub
URL: https://github.com/high-moctane/nextword-data
Owner: high-moctane
License: other
Archived: true
Created: 2019-12-11T05:04:41.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2022-02-19T16:39:16.000Z (about 3 years ago)
Last Synced: 2024-08-01T17:38:40.319Z (7 months ago)
Topics: english, nextword
Size: 531 MB
Stars: 36
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

        # Nextword-data

## **🎉 NEW PREDICTION ENGINE [MOCWORD](https://github.com/high-moctane/mocword) IS AVAILABLE 🎉**

[Mocword](https://github.com/high-moctane/mocword) is more advanced engine than Nextword.

- Less data file size

  - 1.63GB (Nextword) -> 655MB (Mocword)

- Using latest Google Ngram dataset

  - 2012 data (Nextword) -> 2020 data (Mocword)

- More appropriate prediction

- Less noisy vocabularies

---

A dataset for nextword.

## Install

0. (Recommended) Star this repository (｀･ω･´)★

1. Visit [releases](https://github.com/high-moctane/nextword-data/releases) page.

2. Download `zip` or `tar.gz`.

   You can choose larger or smaller one.

   |       | Zip size | Total size |

   | ----- | -------: | ---------: |

   | Small | 152.2 MB |   493.1 MB |

   | Large | 483.3 MB |    1.63 GB |

3. Decompress downloaded data.

4. Set `$NEXTWORD_DATA_PATH` environment variable.

   Example:

   ```bash

   export NEXTWORD_DATA_PATH=/path/to/nextword-data

   ```

## Uninstall

1. Remove `$NEXTWORD_DATA_PATH` environment variable.

2. Remove nextword-data directory.

## Format

```

(n-1)gram tab candidates newline

```

Candidates are sorted by appearance order.

### Example

You can find the line

```

empty milk	bottles carton bottle cartons cans

```

at line 59349 in file `3gram-e.txt`.

This line describes the word "bottles" is the most likely word after "empty milk"

and "carton" is the next.

## Recipe

1. Fetch data.

   ```

   $ mkdir fetch

   $ nwgen-fetch fetch

   ```

2. Run xonsh script.

   ```xonsh

   dstdir = "dstdir"

   mkdir -p @(dstdir)/format

   mkdir -p @(dstdir)/concat

   ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname

   ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname

   ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname

   ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname

   ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname

   nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram*

   for n in [2,3,4,5]:

       for c in [chr(i) for i in range(97, 97+26)]:

           nwgen-concat @(dstdir)/concat/@(n)gram-@(c).txt.gz @(dstdir)/format/@(n)gram-@(c)*

   cp -R @(dstdir)/concat @(dstdir)/data

   gunzip @(dstdir)/data/*

   ```

## Notice

Nextword-data is based on

[Google Books Ngram Viewer English Version 20120701](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

which is distributed under a [Creative Commons Attribution 3.0 Unported](http://creativecommons.org/licenses/by/3.0/).

See [NOTICE.txt](NOTICE.txt).

## License

Nextword-data is distributed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/).

See [LICENSE.txt](LICENSE.txt).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/high-moctane/nextword-data

Awesome Lists containing this project

README