{"id":13591791,"url":"https://github.com/kpu/kenlm","last_synced_at":"2025-04-10T04:53:55.865Z","repository":{"id":1856900,"uuid":"2781754","full_name":"kpu/kenlm","owner":"kpu","description":"KenLM: Faster and Smaller Language Model Queries","archived":false,"fork":false,"pushed_at":"2025-03-30T17:50:54.000Z","size":6102,"stargazers_count":2584,"open_issues_count":136,"forks_count":519,"subscribers_count":69,"default_branch":"master","last_synced_at":"2025-04-03T02:57:37.492Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://kheafield.com/code/kenlm/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"jazzband/django-robots","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kpu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-11-15T17:07:17.000Z","updated_at":"2025-03-31T19:25:37.000Z","dependencies_parsed_at":"2023-07-05T18:18:05.169Z","dependency_job_id":"5cc8e251-688b-476d-a2b9-9e99a75413a5","html_url":"https://github.com/kpu/kenlm","commit_stats":{"total_commits":2093,"total_committers":55,"mean_commits":"38.054545454545455","dds":"0.12756808408982323","last_synced_commit":"f6c947dc943859e265fabce886232205d0fb2b37"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kpu%2Fkenlm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kpu%2Fkenlm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kpu%2Fkenlm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kpu%2Fkenlm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kpu","download_url":"https://codeload.github.com/kpu/kenlm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248161255,"owners_count":21057553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:01:02.100Z","updated_at":"2025-04-10T04:53:55.830Z","avatar_url":"https://github.com/kpu.png","language":"C++","funding_links":[],"categories":["C++","其他_NLP自然语言处理"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# kenlm\n\nLanguage model inference code by Kenneth Heafield (kenlm at kheafield.com)\n\nThe website https://kheafield.com/code/kenlm/ has more documentation.  If you're a decoder developer, please download the latest version from there instead of copying from another decoder.\n\n## Compiling\nUse cmake, see [BUILDING](BUILDING) for build dependencies and more detail.\n```bash\nmkdir -p build\ncd build\ncmake ..\nmake -j 4\n```\n\n## Compiling with your own build system\nIf you want to compile with your own build system (Makefile etc) or to use as a library, there are a number of macros you can set on the g++ command line or in util/have.hh .  \n\n* `KENLM_MAX_ORDER` is the maximum order that can be loaded.  This is done to make state an efficient POD rather than a vector.  \n* `HAVE_ICU` If your code links against ICU, define this to disable the internal StringPiece and replace it with ICU's copy of StringPiece, avoiding naming conflicts.  \n\nARPA files can be read in compressed format with these options:\n* `HAVE_ZLIB` Supports gzip.  Link with -lz.\n* `HAVE_BZLIB` Supports bzip2.  Link with -lbz2.\n* `HAVE_XZLIB` Supports xz.  Link with -llzma.\n\nNote that these macros impact only `read_compressed.cc` and `read_compressed_test.cc`.  The bjam build system will auto-detect bzip2 and xz support.  \n\n## Estimation\nlmplz estimates unpruned language models with modified Kneser-Ney smoothing.  After compiling with bjam, run\n```bash\nbin/lmplz -o 5 \u003ctext \u003etext.arpa\n```\nThe algorithm is on-disk, using an amount of memory that you specify.  See https://kheafield.com/code/kenlm/estimation/ for more.\n\nMT Marathon 2012 team members Ivan Pouzyrevsky and Mohammed Mediani contributed to the computation design and early implementation. Jon Clark contributed to the design, clarified points about smoothing, and added logging. \n\n## Filtering\n\nfilter takes an ARPA or count file and removes entries that will never be queried.  The filter criterion can be corpus-level vocabulary, sentence-level vocabulary, or sentence-level phrases.  Run\n```bash\nbin/filter\n```\nand see https://kheafield.com/code/kenlm/filter/ for more documentation.\n\n## Querying\n\nTwo data structures are supported: probing and trie.  Probing is a probing hash table with keys that are 64-bit hashes of n-grams and floats as values.  Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers.  The trie node entries are sorted by word index.  Probing is the fastest and uses the most memory.  Trie uses the least memory and is a bit slower.\n\nAs is the custom in language modeling, all probabilities are log base 10.\n\nWith trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version.  Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version.  KenLM's probing hash table implementation goes even faster at the expense of using more memory.  See https://kheafield.com/code/kenlm/benchmark/.\n\nBinary format via mmap is supported.  Run `./build_binary` to make one then pass the binary file name to the appropriate Model constructor.   \n\n## Platforms\n`murmur_hash.cc` and `bit_packing.hh` perform unaligned reads and writes that make the code architecture-dependent.  \nIt has been sucessfully tested on x86\\_64, x86, and PPC64.  \nARM support is reportedly working, at least on the iphone.   \n\nRuns on Linux, OS X, Cygwin, and MinGW.  \n\nHideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW.  \n\n## Decoder developers\n- I recommend copying the code and distributing it with your decoder.  However, please send improvements upstream.  \n\n- It's possible to compile the query-only code without Boost, but useful things like estimating models require Boost.\n\n- Select the macros you want, listed in the previous section.  \n\n- There are two build systems: compile.sh and cmake.  They're pretty simple and are intended to be reimplemented in your build system.  \n\n- Use either the interface in `lm/model.hh` or `lm/virtual_interface.hh`.  Interface documentation is in comments of `lm/virtual_interface.hh` and `lm/model.hh`.  \n\n- There are several possible data structures in `model.hh`.  Use `RecognizeBinary` in `binary_format.hh` to determine which one a user has provided.  You probably already implement feature functions as an abstract virtual base class with several children.  I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by `RecognizeBinary`.  This is the strategy used in Moses and cdec.\n\n- See `lm/config.hh` for run-time tuning options.\n\n## Contributors\nContributions to KenLM are welcome.  Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access).  Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there.  \n\n## Python module\nContributed by Victor Chahuneau.\n\n### Installation\n\n```bash\npip install https://github.com/kpu/kenlm/archive/master.zip\n```\n\nWhen installing pip, the `MAX_ORDER` environment variable controls the max order with which KenLM was built.\n\n### Basic Usage\n```python\nimport kenlm\nmodel = kenlm.Model('lm/test.arpa')\nprint(model.score('this is a sentence .', bos = True, eos = True))\n```\nSee [python/example.py](python/example.py) and [python/kenlm.pyx](python/kenlm.pyx) for more, including stateful APIs.  \n\n### Building kenlm - Using vcpkg\n\nYou can download and install kenlm using the [vcpkg](https://github.com/Microsoft/vcpkg) dependency manager:\n\n    git clone https://github.com/Microsoft/vcpkg.git\n    cd vcpkg\n    ./bootstrap-vcpkg.sh\n    ./vcpkg integrate install\n    ./vcpkg install kenlm\n\nThe kenlm port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.\n\n---\n\nThe name was Hieu Hoang's idea, not mine.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkpu%2Fkenlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkpu%2Fkenlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkpu%2Fkenlm/lists"}