{"id":18400410,"url":"https://github.com/kmario23/kenlm-training","last_synced_at":"2025-04-07T06:33:24.599Z","repository":{"id":44595426,"uuid":"152365685","full_name":"kmario23/KenLM-training","owner":"kmario23","description":"Training an n-gram based Language Model using KenLM toolkit for Deep Speech 2","archived":false,"fork":false,"pushed_at":"2019-05-20T11:18:45.000Z","size":6,"stargazers_count":114,"open_issues_count":6,"forks_count":21,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-22T14:07:50.822Z","etag":null,"topics":["automatic-speech-recognition","deep-neural-networks","deep-speech","kenlm","kenlm-toolkit","language-model","language-modeling","natural-language-processing","probabilistic-models","python","speech-recognition"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kmario23.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-10T05:01:58.000Z","updated_at":"2025-02-08T04:58:24.000Z","dependencies_parsed_at":"2022-08-27T02:39:46.976Z","dependency_job_id":null,"html_url":"https://github.com/kmario23/KenLM-training","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmario23%2FKenLM-training","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmario23%2FKenLM-training/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmario23%2FKenLM-training/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmario23%2FKenLM-training/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kmario23","download_url":"https://codeload.github.com/kmario23/KenLM-training/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607729,"owners_count":20965945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic-speech-recognition","deep-neural-networks","deep-speech","kenlm","kenlm-toolkit","language-model","language-modeling","natural-language-processing","probabilistic-models","python","speech-recognition"],"created_at":"2024-11-06T02:32:46.658Z","updated_at":"2025-04-07T06:33:24.593Z","avatar_url":"https://github.com/kmario23.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"## KenLM\nKenLM performs interpolated modified Kneser Ney Smoothing for estimating the n-gram probabilities.\n\n--------\n\n### Step-by-step guide for training an n-gram based Language Model using [KenLM toolkit](https://kheafield.com/code/kenlm/estimation/)\n\n## 1) Installing KenLM dependencies\nBefore installing KenLM toolkit, you should install all the dependencies which can be found in [kenlm-dependencies](https://kheafield.com/code/kenlm/dependencies/).\n\n**For Debian/Ubuntu distro**:\n\nTo get a working compiler, install the `build-essential` package. [Boost](https://www.boost.org/) is known as `libboost-all-dev`. The three supported compression options each have a separate dev package.\n\n    $ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev\n    \n## 2) Installing KenLM toolkit\nFor this, it's suggested to use a *conda or virtualenv* virtual environment. For conda, you can create one using:\n\n    $ conda create -n kenlm_deepspeech python=3.6 nltk\n    \nThen activate the environment using:\n\n    $ source activate kenlm_deepspeech\n    \nNow we're ready to install kenlm. Let's first clone the kenlm repo:\n\n    $ git clone --recursive https://github.com/vchahun/kenlm.git\n\nAnd then compile the LM estimation code using:\n\n    $ cd kenlm\n    $ ./bjam \n   \nAs a final step, optionally, install the Python module using:\n\n    $ python setup.py install\n    \n\n## 3) Training a Language Model\n\nFirst let's get some training data. Here, I'll use the Bible:\n\n    $ wget -c https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2\n   \nNext we will need a simple preprocessing script. The reason is because:\n\n- the training text should be a single text/compressed file (e.g. `.bz2`) which has a single sentence per line.\n- it need to be tokenized and lowercased before feeding it into kenlm\n\nSo, create a simple script `preprocess.py` with the following lines:\n\n```python\nimport sys\nimport nltk\n\nfor line in sys.stdin:\n    for sentence in nltk.sent_tokenize(line):\n        print(' '.join(nltk.word_tokenize(sentence)).lower())\n```\n\nFor sanity check, do:\n\n    $ bzcat bible.en.txt.bz2 | python preprocess.py | wc\n    \nAnd see that it works fine.\n\nNow we can train the model. For training a *trigram model* with Kneser-Ney smoothing, use:\n\n    # -o means `order` which translates to the `n` in n-gram\n    $ bzcat bible.en.txt.bz2 |\\\n      python preprocess.py |\\\n      ./kenlm/bin/lmplz -o 3 \u003e bible.arpa\n\n  The above command will first pipe the data thru the preprocessing script which performs tokenization and lowercasing. Next, this tokenized and lowercased text is piped to the `lmplz` program which performs the estimation work.\n  \n  It should finish in a couple of seconds and then generate an arpa file `bible.arpa`. You can inspect the arpa file using something like `less` or `more` (i.e. `$ less bible.arpa`). In the very beginning, it should have a *data section* with unigram, bigram, and trigram counts followed by the estimated values.\n \n \n #### Binarizing the model\n \n ARPA files can be read directly. But, the binary format loads much faster and provides more flexibility. Using the binary format significantly reduces loading time and also exposes more configuration options. For these reasons, we will binarize the model using:\n \n     $ ./kenlm/bin/build_binary bible.arpa bible.binary\n     \n  Note that, unlike IRSTLM, the file extension does not matter; the binary format is recognized using magic bytes.\n  \n  One can also use `trie` when binarizing. For this, use:\n  \n      $ ./kenlm/bin/build_binary trie bible.arpa bible.binary\n      \n  ----------------------\n  \n  ### Using the model (i.e. scoring sentences)\n  \n  Now that we have a Language Model, we can *score* sentences. It's super easy to do this using the Python interface. Below is an example:\n  \n  ```python\n  import kenlm\n  model = kenlm.LanguageModel('bible.binary')\n  model.score('in the beginning was the word')\n  ```\n  \n  Then, you might get a score such as:\n  \n    -15.03003978729248\n  \n  \n  ---------------\n  \n  #### References:\n  1) http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel\n  2) http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkmario23%2Fkenlm-training","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkmario23%2Fkenlm-training","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkmario23%2Fkenlm-training/lists"}