{"id":13699525,"url":"https://github.com/anyks/alm","last_synced_at":"2025-04-28T22:30:57.622Z","repository":{"id":57410700,"uuid":"240539294","full_name":"anyks/alm","owner":"anyks","description":"Smart Language Model","archived":false,"fork":false,"pushed_at":"2022-12-21T16:40:58.000Z","size":2061,"stargazers_count":46,"open_issues_count":0,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-05T11:23:32.907Z","etag":null,"topics":["alm","arpa","cpp","language-models","tokenization","tokenizer","vocab-pruning"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anyks.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.MIT","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":null,"custom":"https://www.paypal.me/anyks"}},"created_at":"2020-02-14T15:29:19.000Z","updated_at":"2025-01-13T05:52:05.000Z","dependencies_parsed_at":"2023-01-30T04:15:15.881Z","dependency_job_id":null,"html_url":"https://github.com/anyks/alm","commit_stats":null,"previous_names":[],"tags_count":68,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anyks%2Falm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anyks%2Falm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anyks%2Falm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anyks%2Falm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anyks","download_url":"https://codeload.github.com/anyks/alm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251397576,"owners_count":21583034,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alm","arpa","cpp","language-models","tokenization","tokenizer","vocab-pruning"],"created_at":"2024-08-02T20:00:35.528Z","updated_at":"2025-04-28T22:30:56.162Z","avatar_url":"https://github.com/anyks.png","language":"C++","funding_links":["https://www.paypal.me/anyks"],"categories":["C++"],"sub_categories":[],"readme":"[![ANYKS Smart language model](https://raw.githubusercontent.com/anyks/alm/master/site/img/banner.jpg)](https://anyks.com)\n\n# ANYKS LM (ALM) C++11\n\n- [Project goals and features](https://github.com/anyks/alm/#project-goals-and-features)\n- [Requirements](https://github.com/anyks/alm/#requirements)\n- [To build and launch the project](https://github.com/anyks/alm/#to-build-and-launch-the-project)\n    - [Python version ALM](https://github.com/anyks/alm/#python-version-alm)\n    - [To clone the project](https://github.com/anyks/alm/#to-clone-the-project)\n    - [Build on Linux and FreeBSD](https://github.com/anyks/alm/#build-on-linux-and-freebsd)\n    - [Build on MacOS X](https://github.com/anyks/alm/#build-on-macos-x)\n- [Files formats](https://github.com/anyks/alm/#file-formats)\n    - [ARPA](https://github.com/anyks/alm/#arpa)\n    - [Ngrams](https://github.com/anyks/alm/#ngrams)\n    - [Vocab](https://github.com/anyks/alm/#vocab)\n    - [Map](https://github.com/anyks/alm/#map)\n    - [File of adding n-gram into existing ARPA file](https://github.com/anyks/alm/#file-of-adding-n-gram-into-existing-arpa-file)\n    - [File of changing n-gram frequency in existing ARPA file](https://github.com/anyks/alm/#file-of-changing-n-gram-frequency-in-existing-arpa-file)\n    - [File of replacing n-gram in existing ARPA file](https://github.com/anyks/alm/#file-of-replacing-n-gram-in-existing-arpa-file)\n    - [File of similar letters in different dictionaries](https://github.com/anyks/alm/#file-of-similar-letters-in-different-dictionaries)\n    - [File of removing n-gram from existing ARPA file](https://github.com/anyks/alm/#file-of-removing-n-gram-from-existing-arpa-file)\n    - [File of abbreviations list words](https://github.com/anyks/alm/#file-of-abbreviations-list-words)\n    - [File of domain zones list](https://github.com/anyks/alm/#file-of-domain-zones-list)\n    - [Binary container metadata](https://github.com/anyks/alm/#binary-container-metadata)\n    - [The python script format to preprocess the received words](https://github.com/anyks/alm/#the-python-script-format-to-preprocess-the-received-words)\n    - [The python script format to define the word features](https://github.com/anyks/alm/#the-python-script-format-to-define-the-word-features)\n- [Environment variables](https://github.com/anyks/alm/#environment-variables)\n- [Examples](https://github.com/anyks/alm/#examples)\n    - [Language Model training example](https://github.com/anyks/alm/#language-model-training-example)\n    - [ARPA patch example](https://github.com/anyks/alm/#arpa-patch-example)\n    - [Example of removing n-grams with a frequency lower than backoff](https://github.com/anyks/alm/#example-of-removing-n-grams-with-a-frequency-lower-than-backoff)\n    - [Example of merge raw data](https://github.com/anyks/alm/#example-of-merge-raw-data)\n    - [ARPA pruning example](https://github.com/anyks/alm/#arpa-pruning-example)\n    - [Vocab pruning example](https://github.com/anyks/alm/#vocab-pruning-example)\n    - [An example of detecting and correcting words consisting of mixed dictionaries](https://github.com/anyks/alm/#an-example-of-detecting-and-correcting-words-consisting-of-mixed-dictionaries)\n    - [Binary container information](https://github.com/anyks/alm/#binary-container-information)\n    - [ARPA modification example](https://github.com/anyks/alm/#arpa-modification-example)\n    - [Training with preprocessing of received words](https://github.com/anyks/alm/#training-with-preprocessing-of-received-words)\n    - [Training using your own features](https://github.com/anyks/alm/#training-using-your-own-features)\n    - [Example of disabling token identification](https://github.com/anyks/alm/#example-of-disabling-token-identification)\n    - [An example of identifying tokens as 〈unk〉](https://github.com/anyks/alm/#an-example-of-identifying-tokens-as-unk)\n    - [Training using whitelist](https://github.com/anyks/alm/#training-using-whitelist)\n    - [Training using blacklist](https://github.com/anyks/alm/#training-using-blacklist)\n    - [Training with an unknown word](https://github.com/anyks/alm/#training-with-an-unknown-word)\n    - [Text tokenization](https://github.com/anyks/alm/#text-tokenization)\n    - [Perplexity calculation](https://github.com/anyks/alm/#perplexity-calculation)\n    - [Checking context in text](https://github.com/anyks/alm/#checking-context-in-text)\n    - [Fix words case](https://github.com/anyks/alm/#fix-words-case)\n    - [Check counts ngrams](https://github.com/anyks/alm/#check-counts-ngrams)\n    - [Search ngrams by text](https://github.com/anyks/alm/#search-ngrams-by-text)\n    - [Sentences generation](https://github.com/anyks/alm/#sentences-generation)\n    - [Mixing language models](https://github.com/anyks/alm/#mixing-language-models)\n- [License](https://github.com/anyks/alm/#license)\n- [Contact](https://github.com/anyks/alm/#contact-info)\n\n## Project goals and features\n\nThe are many toolkits capable of creating language models: ([KenLM](https://github.com/kpu/kenlm), [SriLM](https://github.com/BitMindLab/SRILM), [IRSTLM](https://github.com/irstlm-team/irstlm)), and each of those toolkits may have a reason to exist. But our language model creation toolkit has the following goals and features:\n\n- **UTF-8 support**: Full UTF-8 support without third-party dependencies.\n- **Support of many data formats**: ARPA, Vocab, Map Sequence, N-grams, Binary alm dictionary.\n- **Smoothing algorithms**: Kneser-Nay, Modified Kneser-Nay, Witten-Bell, Additive, Good-Turing, Absolute discounting.\n- **Normalisation and preprocessing for corpora**: Transferring corpus to lowercase, smart tokenization, ability to create black - and white - lists for n-grams.\n- **ARPA modification**: Frequencies and n-grams replacing, adding new n-grams with frequencies, removing n-grams.\n- **Pruning**: N-gram removal based on specified criteria.\n- **Removal of low-probability n-grams**: Removal of n-grams which backoff probability is higher than standard probability.\n- **ARPA recovery**: Recovery of damaged n-grams in ARPA with subsequent recalculation of their backoff probabilities.\n- **Support of additional word features**: Feature extraction: (numbers, roman numbers, ranges of numbers, numeric abbreviations, any other custom attributes) using scripts written in Python3.\n- **Text preprocessing**: Unlike all other language model toolkits, ALM can extract correct context from files with unnormalized texts.\n- **Unknown word token accounting**: Accounting of 〈unk〉 token as full n-gram.\n- **Redefinition of 〈unk〉 token**: Ability to redefine an attribute of an unknown token.\n- **N-grams preprocessing**: Ability to pre-process n-grams before adding them to ARPA using custom Python3 scripts.\n- **Binary container for Language Models**: The binary container supports compression, encryption and installation of copyrights.\n- **Convenient visualization of the Language model assembly process**: ALM implements several types of visualizations: textual, graphic, process indicator, and logging to files or console.\n- **Collection of all n-grams**: Unlike other language model toolkits, ALM is guaranteed to extract all possible n-grams from the corpus, regardless of their length (except for Modified Kneser-Nay); you can also force all n-grams to be taken into account even if they occured only once.\n\n## Requirements\n\n- [Zlib](http://www.zlib.net)\n- [OpenSSL](https://www.openssl.org)\n- [GperfTools](https://github.com/gperftools/gperftools)\n- [Python3](https://www.python.org/download/releases/3.0)\n- [NLohmann::json](https://github.com/nlohmann/json)\n- [BigInteger](http://mattmccutchen.net/bigint)\n\n## To build and launch the project\n\n### Python version ALM\n```bash\n$ python3 -m pip install pybind11\n$ python3 -m pip install anyks-lm\n```\n\n[Documentation pip](https://pypi.org/project/anyks-lm)\n\n### To clone the project\n\n```bash\n$ git clone --recursive https://github.com/anyks/alm.git\n```\n\n### Build third party\n```bash\n$ ./build_third_party.sh\n```\n\n### Build on Linux/MacOS X and FreeBSD\n\n```bash\n$ mkdir ./build\n$ cd ./build\n\n$ cmake ..\n$ make\n```\n\n## File formats\n\n### ARPA\n```\n\\data\\\nngram 1=52\nngram 2=68\nngram 3=15\n\n\\1-grams:\n-1.807052\t1-й\t-0.30103\n-1.807052\t2\t-0.30103\n-1.807052\t3~4\t-0.30103\n-2.332414\tкак\t-0.394770\n-3.185530\tпосле\t-0.311249\n-3.055896\tтого\t-0.441649\n-1.150508\t\u003c/s\u003e\n-99\t\u003cs\u003e\t-0.3309932\n-2.112406\t\u003cunk\u003e\n-1.807052\tT358\t-0.30103\n-1.807052\tVII\t-0.30103\n-1.503878\tГрека\t-0.39794\n-1.807052\tГреку\t-0.30103\n-1.62953\tЕхал\t-0.30103\n...\n\n\\2-grams:\n-0.29431\t1-й передал\n-0.29431\t2 ложки\n-0.29431\t3~4 дня\n-0.8407791\t\u003cs\u003e Ехал\n-1.328447\tпосле того\t-0.477121\n...\n\n\\3-grams:\n-0.09521468\tрак на руке\n-0.166590\tпосле того как\n...\n\n\\end\\\n```\n\n| Frequency             | N-gram                       | Reverse frequency          |\n|-----------------------|------------------------------|----------------------------|\n| -1.328447             | после того                   | -0.477121                  |\n\n#### Description:\n - **〈s〉** - Sentence beginning token\n - **〈/s〉** - Sentence end token\n - **〈url〉** - URL-address token\n - **〈num〉** - Number (arabic or roman) token\n - **〈unk〉** - Unknown word token\n - **〈time〉** - Time token **(15:44:56)**\n - **〈score〉** - Score count token **(4:3 | 01:04)**\n - **〈fract〉** - Fraction token **(5/20 | 192/864)**\n - **〈date〉** - Date token **(18.07.2004 | 07/18/2004)**\n - **〈abbr〉** - Abbreviation token **(1-й | 2-е | 20-я)**\n - **〈dimen〉** - Dimensions token **(200x300 | 1920x1080)**\n - **〈range〉** - Range of numbers token **(1-2 | 100-200 | 300-400)**\n - **〈aprox〉** - Approximate number token (**~93** | **~95.86** | **10~20**)\n - **〈anum〉** - Pseudo-number token (combination of numbers and other symbols) **(T34 | 895-M-86 | 39km)**\n - **〈pcards〉** - Symbols of the play cards **(♠ | ♣ | ♥ | ♦ )**\n - **〈punct〉** - Punctuation token **(. | , | ? | ! | : | ; | … | ¡ | ¿)**\n - **〈route〉** - Direction symbols (arrows) **(← | ↑ | ↓ | ↔ | ↵ | ⇐ | ⇑ | ⇒ | ⇓ | ⇔ | ◄ | ▲ | ► | ▼)**\n - **〈greek〉** - Symbols of the Greek alphabet **(Α | Β | Γ | Δ | Ε | Ζ | Η | Θ | Ι | Κ | Λ | Μ | Ν | Ξ | Ο | Π | Ρ | Σ | Τ | Υ | Φ | Χ | Ψ | Ω)**\n - **〈isolat〉** - Isolation/quotation token **(( | ) | [ | ] | { | } | \" | « | » | „ | “ | ` | ⌈ | ⌉ | ⌊ | ⌋ | ‹ | › | ‚ | ’ | ′ | ‛ | ″ | ‘ | ” | ‟ | ' |〈 | 〉)**\n - **〈specl〉** - Special character token **(_ | @ | # | № | © | ® | \u0026 | ¦ | § | æ | ø | Þ | – | ‾ | ‑ | — | ¯ | ¶ | ˆ | ˜ | † | ‡ | • | ‰ | ⁄ | ℑ | ℘ | ℜ | ℵ | ◊ | \\ )**\n - **〈currency〉** - Symbols of world currencies **($ | € | ₽ | ¢ | £ | ₤ | ¤ | ¥ | ℳ | ₣ | ₴ | ₸ | ₹ | ₩ | ₦ | ₭ | ₪ | ৳ | ƒ | ₨ | ฿ | ₫ | ៛ | ₮ | ₱ | ﷼ | ₡ | ₲ | ؋ | ₵ | ₺ | ₼ | ₾ | ₠ | ₧ | ₯ | ₢ | ₳ | ₥ | ₰ | ₿ | ұ)**\n - **〈math〉** - Mathematical operation token **(+ | - | = | / | * | ^ | × | ÷ | − | ∕ | ∖ | ∗ | √ | ∝ | ∞ | ∠ | ± | ¹ | ² | ³ | ½ | ⅓ | ¼ | ¾ | % | ~ | · | ⋅ | ° | º | ¬ | ƒ | ∀ | ∂ | ∃ | ∅ | ∇ | ∈ | ∉ | ∋ | ∏ | ∑ | ∧ | ∨ | ∩ | ∪ | ∫ | ∴ | ∼ | ≅ | ≈ | ≠ | ≡ | ≤ | ≥ | ª | ⊂ | ⊃ | ⊄ | ⊆ | ⊇ | ⊕ | ⊗ | ⊥ | ¨)**\n\n---\n\n### Ngrams\n```\n\\data\\\nad=1\ncw=23832\nunq=9390\n\nngram 1=9905\nngram 2=21907\nngram 3=306\n\n\\1-grams:\n\u003cs\u003e\t2022 | 1\n\u003cnum\u003e\t117 | 1\n\u003cunk\u003e\t19 | 1\n\u003cabbr\u003e\t16 | 1\n\u003crange\u003e\t7 | 1\n\u003c/s\u003e\t2022 | 1\nА\t244 | 1\nа\t244 | 1\nб\t11 | 1\nв\t762 | 1\nвыборах\t112 | 1\nобзорах\t224 | 1\nполовозрелые\t1 | 1\nнебесах\t86 | 1\nизобретали\t978 | 1\nяблочную\t396 | 1\nджинсах\t108 | 1\nклассах\t77 | 1\nтрассах\t32 | 1\n...\n\n\\2-grams:\n\u003cs\u003e \u003cnum\u003e\t7 | 1\n\u003cs\u003e \u003cunk\u003e\t1 | 1\n\u003cs\u003e а\t84 | 1\n\u003cs\u003e в\t83 | 1\n\u003cs\u003e и\t57 | 1\nи классные\t82 | 1\nи валютные\t11 | 1\nи несправедливости\t24 | 1\nснилось являлось\t18 | 1\nнашлось никого\t31 | 1\nсоответственно вы\t45 | 1\nсоответственно дома\t97 | 1\nсоответственно наша\t71 | 1\n...\n\n\\3-grams:\n\u003cs\u003e \u003cnum\u003e \u003c/s\u003e\t3 | 1\n\u003cs\u003e а в\t6 | 1\n\u003cs\u003e а я\t4 | 1\n\u003cs\u003e а на\t2 | 1\n\u003cs\u003e а то\t3 | 1\nможно и нужно\t2 | 1\nбудет хорошо \u003c/s\u003e\t2 | 1\nпейзажи за окном\t2 | 1\nстатусы для одноклассников\t2 | 1\nтолько в одном\t2 | 1\nработа связана с\t2 | 1\nговоря про то\t2 | 1\nотбеливания зубов \u003c/s\u003e\t2 | 1\nпродолжение следует \u003c/s\u003e\t3 | 1\nпрепараты от варикоза\t2 | 1\n...\n\n\\end\\\n```\n\n| N-gram                | Occurrence in corpus         | Occurrence in documents    |\n|-----------------------|------------------------------|----------------------------|\n| только в одном        | 2                            | 1                          |\n\n#### Description:\n\n- **ad** - The number of documents in corpus\n- **cw** - The number of words in all documents\n- **unq** - The number of unique words collected in corpus\n\n---\n\n### Vocab\n```\n\\data\\\nad=1\ncw=23832\nunq=9390\n\n\\words:\n33\tа\t244 | 1 | 0.010238 | 0.000000 | -3.581616\n34\tб\t11 | 1 | 0.000462 | 0.000000 | -6.680889\n35\tв\t762 | 1 | 0.031974 | 0.000000 | -2.442838\n40\tж\t12 | 1 | 0.000504 | 0.000000 | -6.593878\n330344\tбыл\t47 | 1 | 0.001972 | 0.000000 | -5.228637\n335190\tвам\t17 | 1 | 0.000713 | 0.000000 | -6.245571\n335192\tдам\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n335202\tнам\t22 | 1 | 0.000923 | 0.000000 | -5.987742\n335206\tсам\t7 | 1 | 0.000294 | 0.000000 | -7.132874\n335207\tтам\t29 | 1 | 0.001217 | 0.000000 | -5.711489\n2282019644\tпохожесть\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n2282345502\tновый\t10 | 1 | 0.000420 | 0.000000 | -6.776199\n2282416889\tбелый\t2 | 1 | 0.000084 | 0.000000 | -8.385637\n3009239976\tгражданский\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3009763109\tбанкиры\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3013240091\tгеныч\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3014009989\tпреступлениях\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3015727462\tтысяч\t2 | 1 | 0.000084 | 0.000000 | -8.385637\n3025113549\tпозаботьтесь\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3049820849\tкомментарием\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3061388599\tкомпьютерная\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3063804798\tшаблонов\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3071212736\tзавидной\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3074971025\tхолодной\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3075044360\tвыходной\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3123271427\tделаешь\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3123322362\tчитаешь\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n3126399411\tготовится\t1 | 1 | 0.000042 | 0.000000 | -9.078785\n...\n```\n\n| Word Id               | Word      | Occurrence in corpus       | Occurrence in documents    | tf       | tf-idf   | wltf      |\n|-----------------------|-----------|----------------------------|----------------------------|----------|----------|-----------|\n| 2282345502            | новый     | 10                         | 1                          | 0.000420 | 0.000000 | -6.776199 |\n\n#### Description:\n\n- **oc** - Occurrence in corpus\n- **dc** - Occurrence in documents\n- **tf** - Term frequency — the ratio of a word occurrence to the total number of words in a document. Thus, the importance of a word is evaluated within a single document, calculation formula is: [tf = oc / cw]\n- **idf** - Inverse document frequency for word, calculation formula: [idf = log(ad / dc)]\n- **tf-idf** - It's calculated by the formula: [tf-idf = tf * idf]\n- **wltf** - Word rating, calculation formula: [wltf = 1 + log(tf * dc)]\n\n---\n\n### Map\n```\n1:{2022,1,0}|42:{57,1,0}|279603:{2,1,0}\n1:{2022,1,0}|42:{57,1,0}|320749:{2,1,0}\n1:{2022,1,0}|42:{57,1,0}|351283:{2,1,0}\n1:{2022,1,0}|42:{57,1,0}|379815:{3,1,0}\n1:{2022,1,0}|42:{57,1,0}|26122748:{3,1,0}\n1:{2022,1,0}|44:{6,1,0}\n1:{2022,1,0}|48:{1,1,0}\n1:{2022,1,0}|51:{11,1,0}|335967:{3,1,0}\n1:{2022,1,0}|53:{14,1,0}|371327:{3,1,0}\n1:{2022,1,0}|53:{14,1,0}|40260976:{7,1,0}\n1:{2022,1,0}|65:{68,1,0}|34:{2,1,0}\n1:{2022,1,0}|65:{68,1,0}|3277:{3,1,0}\n1:{2022,1,0}|65:{68,1,0}|278003:{2,1,0}\n1:{2022,1,0}|65:{68,1,0}|320749:{2,1,0}\n1:{2022,1,0}|65:{68,1,0}|11353430797:{2,1,0}\n1:{2022,1,0}|65:{68,1,0}|34270133320:{2,1,0}\n1:{2022,1,0}|65:{68,1,0}|51652356484:{2,1,0}\n1:{2022,1,0}|65:{68,1,0}|66967237546:{2,1,0}\n1:{2022,1,0}|2842:{11,1,0}|42:{7,1,0}\n...\n```\n\n\u003e This file is for technical use only. In combination with the **[vocab](https://github.com/anyks/alm#vocab)** file, you can combine several language models, modify, store, distribute and extract any formats ([ARPA](https://github.com/anyks/alm#arpa), [ngrams](https://github.com/anyks/alm#ngrams), [vocab](https://github.com/anyks/alm#vocab), [alm](https://github.com/anyks/alm#binary-container-metadata)).\n\n---\n\n### File of adding n-gram into existing ARPA file\n```\n-3.002006\tСША\n-1.365296\tграниц США\n-0.988534\tу границ США\n-1.759398\tзамуж за\n-0.092796\tсобираюсь замуж за\n-0.474876\tи тоже\n-19.18453\tможно и тоже\n...\n```\n\n| N-gram frequency      | Separator   | N-gram       |\n|-----------------------|-------------|--------------|\n| -0.988534             | \\t          | у границ США |\n\n---\n\n### File of changing n-gram frequency in existing ARPA file\n```\n-0.6588787\tполучайте удовольствие \u003c/s\u003e\n-0.6588787\tтолько в одном\n-0.6588787\tработа связана с\n-0.6588787\tмужчины и женщины\n-0.6588787\tговоря про то\n-0.6588787\tпотому что я\n-0.6588787\tпотому что это\n-0.6588787\tработу потому что\n-0.6588787\tпейзажи за окном\n-0.6588787\tстатусы для одноклассников\n-0.6588787\tвообще не хочу\n...\n```\n\n| N-gram frequency      | Separator   | N-gram            |\n|-----------------------|-------------|-------------------|\n| -0.6588787            | \\t          | мужчины и женщины |\n\n---\n\n### File of replacing n-gram in existing ARPA file\n```\nкоем случае нельзя\tтам да тут\nно тем не\tда ты что\nнеожиданный у\tожидаемый к\nв СМИ\tв ФСБ\nШах\tМат\n...\n```\n\n| Existing N-gram       | Separator   | New N-gram        |\n|-----------------------|-------------|-------------------|\n| но тем не             | \\t          | да ты что         |\n\n---\n\n### File of removing n-gram from existing ARPA file\n```\nну то есть\nну очень большой\nбы было если\nмы с ней\nты смеешься над\nдва года назад\nнад тем что\nили еще что-то\nкак я понял\nкак ни удивительно\nкак вы знаете\nтак и не\nвсе-таки права\nвсе-таки болят\nвсе-таки сдохло\nвсе-таки встала\nвсе-таки решился\nуже\nмне\nмое\nвсе\n...\n```\n\n---\n\n### File of similar letters in different dictionaries\n```\np  р\nc  с\no  о\nt  т\nk  к\ne  е\na  а\nh  н\nx  х\nb  в\nm  м\n...\n```\n\n| Letter for search | Separator | Letter for replace |\n|-------------------|-----------|--------------------|\n| t                 | \\t        | т                  |\n\n---\n\n### File of abbreviations list words\n```\nг\nр\nСША\nул\nруб\nрус\nчел\n...\n```\n\n\u003e All words from this list will be identificate as an unknown word **〈abbr〉**.\n\n---\n\n### File of domain zones list\n```\nru\nsu\ncc\nnet\ncom\norg\ninfo\n...\n```\n\n\u003e For more accurate identification of the **〈url〉** token, you should add your own domain zones (all domain zones in the example are already pre-installed).\n\n---\n\n### The python script format to preprocess the received words\n```python\n# -*- coding: utf-8 -*-\n\ndef init():\n    \"\"\"\n    Initialization Method: Runs only once at application startup\n    \"\"\"\n\ndef run(word, context):\n    \"\"\"\n    Processing start method: starts when a word is extracted from text\n    @word    word for processing\n    @context sequence of previous words as an array\n    \"\"\"\n    return word\n```\n\n---\n\n### The python script format to define the word features\n```python\n# -*- coding: utf-8 -*-\n\ndef init():\n    \"\"\"\n    Initialization Method: Runs only once at application startup\n    \"\"\"\n\ndef run(token, word):\n    \"\"\"\n    Processing start method: starts when a word is extracted from text\n    @token word token name\n    @word  word for processing\n    \"\"\"\n    if token and (token == \"\u003cusa\u003e\"):\n        if word and (word.lower() == \"usa\"): return \"ok\"\n    elif token and (token == \"\u003crussia\u003e\"):\n        if word and (word.lower() == \"russia\"): return \"ok\"\n    return \"no\"\n```\n\n---\n\n### Environment variables\n\n- All parameters can be passed through environment variables. Variables should begin with the prefix **ALM_** and must be written in upper case, their names should correspond to the application parameters.\n- If both application parameters and environment variables are specified at the same time, application parameters will take precedence.\n\n```bash\n$ export ALM_SMOOTHING=wittenbell\n\n$ export ALM_W-ARPA=./lm.arpa\n```\n\n- Example JSON format file\n\n```json\n{\n  \"size\": 3,\n  \"debug\": 1,\n  \"allow-unk\": true,\n  \"interpolate\": true,\n  \"method\": \"train\",\n  \"w-map\": \"./lm.map\",\n  \"w-arpa\": \"./lm.arpa\",\n  \"corpus\": \"./text.txt\",\n  \"w-vocab\": \"./lm.vocab\",\n  \"w-ngram\": \"./lm.ngrams\",\n  \"smoothing\": \"wittenbell\",\n  \"alphabet\": \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\"\n}\n```\n\n---\n\n## Examples\n\n![Program operation example](https://raw.githubusercontent.com/anyks/alm/master/site/img/screen1.png \"Program operation example\")\n\n### Language Model training example\n\n**Smoothing Algorithm: Witten-Bell, single-file build by JSON**\n```bash\n$ ./alm -r-json ./config.json\n```\n\n**Smoothing Algorithm: Witten-Bell, single-file build**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./text.txt\n```\n\n**Smoothing Algorithm: Absolute discounting, build from a group of files**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing cdiscount -discount 0.3 -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt\n```\n\n**Smoothing Algorithm: Additive, build from a group of files**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing addsmooth -delta 0.3 -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt\n```\n\n**Smoothing Algorithm: Kneser-Nay, build from a group of files**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing kneserney -kneserney-modified -kneserney-prepares -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt\n```\n\n**Smoothing Algorithm: Good-Turing, build from a group of files from binary container**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing goodturing -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt -w-bin ./lm.alm -bin-aes 128 -bin-password 911 -bin-name test -bin-lictype MIT -w-bin-arpa -w-bin-utokens -w-bin-options -w-bin-preword -w-bin-badwords -w-bin-goodwords\n```\n\n**Smoothing Algorithm: Witten-Bell, build from binary container**\n```bash\n$ ./alm -r-bin ./lm.alm -bin-aes 128 -bin-password 911 -method train -debug 1 -size 3 -smoothing wittenbell -w-arpa ./lm.arpa\n```\n\n### ARPA patch example\n\n```bash\n./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method repair -debug 1 -w-arpa ./lm2.arpa -allow-unk -interpolate -r-arpa ./lm1.arpa\n```\n\n### Example of removing n-grams with a frequency lower than backoff\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method sweep -debug 1 -w-arpa ./lm2.arpa -allow-unk -interpolate -r-arpa ./lm1.arpa\n```\n\n### Example of merge raw data\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method merge -debug 1 -r-map ./path -r-vocab ./path -w-map ./lm.map -w-vocab ./lm.vocab\n```\n\n### ARPA pruning example\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method aprune -debug 1 -w-arpa ./lm2.arpa -allow-unk -r-map ./lm.map -r-vocab ./lm.vocab -aprune-threshold 0.003 -aprune-max-gram 2\n```\n\n### Vocab pruning example\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method vprune -debug 1 -w-arpa ./lm2.arpa -allow-unk -w-vocab ./lm2.vocab -r-map ./lm.map -r-vocab ./lm.vocab -vprune-wltf -9.11\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method vprune -debug 1 -w-arpa ./lm2.arpa -allow-unk -w-vocab ./lm2.vocab -r-map ./lm.map -r-vocab ./lm.vocab -vprune-oc 5892\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method vprune -debug 1 -w-arpa ./lm2.arpa -allow-unk -w-vocab ./lm2.vocab -r-map ./lm.map -r-vocab ./lm.vocab -vprune-dc 624\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method vprune -debug 1 -w-arpa ./lm2.arpa -allow-unk -w-vocab ./lm2.vocab -r-map ./lm.map -r-vocab ./lm.vocab -vprune-oc 5892 -vprune-dc 624\n```\n\n\u003e **Vocabulary pruning** - removes low-frequency words that are supposed to contain **errors/typos**. Pruning is done according to the threshold of the **wltf**, **oc** or **dc** parameters.\n\n### An example of detecting and correcting words consisting of mixed dictionaries\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -reset-unk -interpolate -mixed-dicts -corpus ./text.txt -mix-restwords ./restwords.txt\n```\n\n\u003e Words in the text that contain typos in the form of similar letters of the alphabet of another language will be corrected if there are letters to replace in [restwords.txt](https://github.com/anyks/alm/#file-of-similar-letters-in-different-dictionaries).\n\n### Binary container information\n```bash\n$ ./alm -r-bin ./lm.alm -bin-aes 128 -bin-password 911 -method info\n```\n\n### ARPA modification example\n\n**Adding n-gram to ARPA**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method modify -modify emplace -modify-file ./app.txt -debug 1 -w-arpa ./lm.arpa -allow-unk -interpolate -r-map ./lm.map -r-vocab ./lm.vocab\n```\n\n**Changing n-gram frequencies in ARPA**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method modify -modify change -modify-file ./chg.txt -debug 1 -w-arpa ./lm.arpa -allow-unk -interpolate -r-map ./lm.map -r-vocab ./lm.vocab\n```\n\n**Removing n-gram from ARPA**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method modify -modify remove -modify-file ./rm.txt -debug 1 -w-arpa ./lm.arpa -allow-unk -interpolate -r-map ./lm.map -r-vocab ./lm.vocab\n```\n\n**Changing n-gram in ARPA**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method modify -modify replace -modify-file ./rep.txt -debug 1 -w-arpa ./lm.arpa -allow-unk -interpolate -r-map ./lm.map -r-vocab ./lm.vocab\n```\n\n### Training with preprocessing of received words\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt -word-script ./wordTest.py\n```\n\n\u003e Sometimes it is necessary to change a word before it is added to ARPA - this can be done using the script [**wordTest.py**](https://github.com/anyks/alm#the-python-script-format-to-preprocess-the-received-words) the word and its context will be passed into script.\n\n### Training using your own features\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt -utokens \"usa|russia\" -utoken-script ./utokenTest.py\n```\n\n\u003e The example adds its own features **usa** and **russia**, when processing text all words, that script [**utokenTest.py**](https://github.com/anyks/alm#the-python-script-format-to-define-the-word-features) marks as feature, will be added to ARPA with feature name.\n\n### Example of disabling token identification\n\n**Smoothing algorithm: Witten-Bell, assembly with disabled tokens**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -reset-unk -interpolate -tokens-disable \"num|url|abbr|date|time|anum|math|rnum|specl|range|aprox|score|dimen|fract|punct|isolat\" -corpus ./text.txt\n```\n\n\u003e Here is the **rnum** token, which is a Roman number, but is not used as an independent token.\n\n**Smoothing algorithm: Witten-Bell, assembly with all disabled tokens**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -reset-unk -interpolate -tokens-all-disable -corpus ./text.txt\n```\n\n\u003e In the example, the identification of all tokens is disabled, disabled tokens will be added to ARPA as separate words.\n\n### An example of identifying tokens as 〈unk〉\n\n**Smoothing algorithm: Witten-Bell, assembly with identification of tokens as 〈unk〉**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -reset-unk -interpolate -tokens-unknown \"num|url|abbr|date|time|anum|math|rnum|specl|range|aprox|score|dimen|fract|punct|isolat\" -corpus ./text.txt\n```\n\n\u003e Here is the **rnum** token, which is a Roman number, but is not used as an independent token.\n\n**Smoothing algorithm: Witten-Bell, assembly with identification of all tokens as 〈unk〉**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -reset-unk -interpolate -tokens-all-unknown -corpus ./text.txt\n```\n\n\u003e The example identifies all tokens as как 〈unk〉.\n\n### Training using whitelist\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt -goodwords ./goodwords.txt\n```\n\n\u003e If you specify a whitelist during training, all words specified in the white list will be forcibly added to ARPA.\n\n### Training using blacklist\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt -badwords ./badwords.txt\n```\n\n\u003e If you specify a black list during training, all the words indicated in the black list will be equated with the token **〈unk〉**.\n\n### Training with an unknown word\n\n```bash\n./bin/alm.exe -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -size 3 -smoothing wittenbell -method train -debug 1 -w-arpa ./lm.arpa -w-map ./lm.map -w-vocab ./lm.vocab -w-ngram ./lm.ngrams -allow-unk -interpolate -corpus ./corpus -ext txt -unknown-word goga\n```\n\n\u003e In this example the token **〈unk〉** in ARPA will be replaced by the word specified in the parameter [-unknown-word | --unknown-word=〈value〉], in our case it's word **goga**.\n\n### Text tokenization\n\n**Generating json file from text**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens -debug 1 -r-tokens-text ./text.txt -w-tokens-json ./tokens.json\n```\n\n**Correction of text files**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens -debug 1 -r-tokens-text ./text.txt -w-tokens-text ./text.txt\n```\n\n**Generating text from json file**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens -debug 1 -r-tokens-json ./tokens.json -w-tokens-text ./text.txt\n```\n\n**Generating json files from a group of texts**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens -debug 1 -r-tokens-path ./path_text -w-tokens-path ./path_json -ext txt\n```\n\n**Generating texts from a group of json files**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens -debug 1 -r-tokens-path ./path_json -w-tokens-path ./path_text -ext json\n```\n\n**Generating json from text string**\n```bash\n$ echo 'Hello World?' | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens\n```\n\n**Generating text string from json**\n```bash\n$ echo '[[\"Hello\",\"World\",\"?\"]]' | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method tokens\n```\n\n### Perplexity calculation\n```bash\n$ echo \"неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method ppl -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method ppl -debug 1 -r-arpa ./lm.arpa -confidence -r-text ./text.txt -threads 0\n```\n\n### Checking context in text\n**Smart checking**\n```bash\n$ echo \"\u003cs\u003e Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором \u003c/s\u003e\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method checktext -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n**Smart checking by step size n-gram 3**\n```bash\n$ echo \"\u003cs\u003e Сегодня сыграл и в Олега ударил яркий прожектор патрульный трактор с корпоративным сектором \u003c/s\u003e\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method checktext -debug 1 -step 3 -r-arpa ./lm.arpa -confidence\n```\n\n**Accurate checking**\n```bash\n$ echo \"\u003cs\u003e в Олега ударил яркий \u003c/s\u003e\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method checktext -debug 1 -r-arpa ./lm.arpa -confidence -accurate\n```\n\n**Checking by file**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method checktext -debug 1 -r-arpa ./lm.arpa -step 3 -confidence -r-text ./text.txt -w-text ./checks.txt -threads 0\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method checktext -debug 1 -r-arpa ./lm.arpa -accurate -confidence -r-text ./text.txt -w-text ./checks.txt -threads 0\n```\n\n### Fix words case\n```bash\n$ echo \"неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method fixcase -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method fixcase -debug 1 -r-arpa ./lm.arpa -confidence -r-text ./text.txt -w-text ./fix.txt -threads 0\n```\n\n### Check counts ngrams\n```bash\n$ echo \"неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method counts -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n```bash\n$ echo \"неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method counts -ngrams bigram -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n```bash\n$ echo \"неожиданно из подворотни в Олега ударил яркий прожектор патрульный трактор???с лязгом выкатился и остановился возле мальчика....\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method counts -ngrams trigram -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method counts -ngrams bigram -debug 1 -r-arpa ./lm.arpa -confidence -r-text ./text.txt -w-text ./counts.txt -threads 0\n```\n\n### Search ngrams by text\n\n```bash\n$ echo \"Особое место занимает чудотворная икона Лобзание Христа Иудою\" | ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method find -debug 1 -r-arpa ./lm.arpa -confidence\n```\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method find -debug 1 -r-arpa ./lm.arpa -confidence -r-text ./text.txt -w-text ./found.txt -threads 0\n```\n\n### Sentences generation\n\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method sentences -gen 5 -debug 1 -r-arpa ./lm.arpa -confidence -w-text ./sentences.txt\n```\n\n### Mixing language models\n\n**Static mixing**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method mix -mix static -debug 1 -r-arpa ./lm1.arpa -mix-arpa1 ./lm2.arpa -mix-lambda1 0.5 -w-arpa ./lm.arpa -confidence -mix-backward\n```\n\n**Bayes mixing**\n```bash\n$ ./alm -alphabet \"abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя\" -method mix -mix bayes -debug 1 -r-arpa ./lm1.arpa -mix-arpa1 ./lm2.arpa -mix-lambda1 0.5 -w-arpa ./lm.arpa -confidence -mix-bayes-scale 0.5 -mix-bayes-length 3\n```\n\n* * *\n\n## License\n\n![MIT License](http://opensource.org/trademarks/opensource/OSI-Approved-License-100x137.png \"MIT License\")\n\nThe class is licensed under the [MIT License](http://opensource.org/licenses/MIT):\n\nCopyright © 2020 [Yuriy Lobarev](https://anyks.com)\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n* * *\n\n## Contact Info\n\nIf you have questions regarding the library, I would like to invite you to [open an issue at GitHub](https://github.com/anyks/alm/issues/new/choose). Please describe your request, problem, or question as detailed as possible, and also mention the version of the library you are using as well as the version of your compiler and operating system. Opening an issue at GitHub allows other users and contributors to this library to collaborate.\n\n---\n\n[Yuriy Lobarev](https://anyks.com) \u003cforman@anyks.com\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanyks%2Falm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanyks%2Falm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanyks%2Falm/lists"}