https://github.com/ccoreilly/generate-n-gram-lm
https://github.com/ccoreilly/generate-n-gram-lm
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ccoreilly/generate-n-gram-lm
- Owner: ccoreilly
- License: apache-2.0
- Created: 2021-06-16T21:05:24.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-08-16T11:44:08.000Z (almost 5 years ago)
- Last Synced: 2025-06-19T04:06:46.932Z (about 1 year ago)
- Language: Python
- Size: 8.79 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
# Generate n-gram LM
Simple tooling for generating n-gram language models with KenLM.
A 4-gram LM for the Catalan language can be found [here](https://zenodo.org/record/4977061).
## Building the Docker Image
```sh
docker build . -f Dockerfile -t kenlm
```
## Building a language model
```sh
docker run -it --rm -v `pwd`:/io -w /io kenlm python generate_lm.py --input_txt catalan_textual_corpus.txt \ --output_dir . --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 \ --binary_type trie
```