{"id":13679911,"url":"https://github.com/kampersanda/tongrams-rs","last_synced_at":"2025-04-23T22:27:56.749Z","repository":{"id":45034141,"uuid":"441872502","full_name":"kampersanda/tongrams-rs","owner":"kampersanda","description":"Rust library providing fast language model queries in compressed space","archived":false,"fork":false,"pushed_at":"2022-10-01T22:01:41.000Z","size":6824,"stargazers_count":24,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T04:11:19.264Z","etag":null,"topics":["compression","elias-fano","language-model","ngrams","nlp","trie"],"latest_commit_sha":null,"homepage":"https://docs.rs/tongrams","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kampersanda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-26T11:19:00.000Z","updated_at":"2025-02-23T14:34:03.000Z","dependencies_parsed_at":"2022-09-26T20:40:51.778Z","dependency_job_id":null,"html_url":"https://github.com/kampersanda/tongrams-rs","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Ftongrams-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Ftongrams-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Ftongrams-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kampersanda%2Ftongrams-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kampersanda","download_url":"https://codeload.github.com/kampersanda/tongrams-rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250525650,"owners_count":21445067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","elias-fano","language-model","ngrams","nlp","trie"],"created_at":"2024-08-02T13:01:10.954Z","updated_at":"2025-04-23T22:27:56.702Z","avatar_url":"https://github.com/kampersanda.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# `tongrams-rs`: Tons of *N*-grams in Rust\n\n![](https://github.com/kampersanda/tongrams-rs/actions/workflows/rust.yml/badge.svg)\n[![Documentation](https://docs.rs/tongrams/badge.svg)](https://docs.rs/tongrams)\n[![Crates.io](https://img.shields.io/crates/v/tongrams.svg)](https://crates.io/crates/tongrams)\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/kampersanda/tongrams-rs/blob/master/LICENSE)\n\nThis is a Rust port of [`tongrams`](https://github.com/jermp/tongrams) to index and query large language models in compressed space, in which the data structures are presented in the following papers:\n\n - Giulio Ermanno Pibiri and Rossano Venturini, [Efficient Data Structures for Massive N-Gram Datasets](https://doi.org/10.1145/3077136.3080798). In *Proceedings of the 40th ACM Conference on Research and Development in Information Retrieval (SIGIR 2017)*, pp. 615-624.\n \n - Giulio Ermanno Pibiri and Rossano Venturini, [Handling Massive N-Gram Datasets Efficiently](https://doi.org/10.1145/3302913). *ACM Transactions on Information Systems (TOIS)*, 37.2 (2019): 1-41.\n\n## What can do\n\n - Store *N*-gram language models with frequency counts.\n\n - Look up *N*-grams to get the frequency counts.\n\n## Features\n\n - **Compressed language model.** `tongrams-rs` can store large *N*-gram language models in very compressed space. For example, the word *N*-gram datasets (*N*=1..5) in `test_data` are stored in only 2.6 bytes per gram.\n  \n - **Time and memory efficiency.** `tongrams-rs` employs *Elias-Fano Trie*, which cleverly encodes a trie data structure consisting of *N*-grams through *Elias-Fano codes*, enabling fast lookups in compressed space.\n  \n - **Pure Rust.** `tongrams-rs` is written only in Rust and can be easily pluged into your Rust codes.\n\n## Input data format\n\nThe file format of *N*-gram counts files is the same as that used in [`tongrams`](https://github.com/jermp/tongrams), a modified [Google format](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html), where\n\n - one separate file for each distinct value of *N* (order) lists one gram per row,\n - each header row `\u003cnumber_of_grams\u003e` indicates the number of *N*-grams in the file,\n - tokens in a gram `\u003cgram\u003e` are sparated by a space (e.g., `the same time`), and\n - a gram `\u003cgram\u003e` and the count `\u003ccount\u003e` are sparated by a horizontal tab.\n\n```text\n\u003cnumber_of_grams\u003e\n\u003cgram1\u003e\u003cTAB\u003e\u003ccount1\u003e\n\u003cgram2\u003e\u003cTAB\u003e\u003ccount2\u003e\n\u003cgram3\u003e\u003cTAB\u003e\u003ccount3\u003e\n...\n```\n\nFor example,\n\n```text\n61516\nthe // parent\t1\nthe function is\t22\nthe function a\t4\nthe function to\t1\nthe function and\t1\n...\n```\n\n## Command line tools\n\n`tools` provides some command line tools to enjoy this library. In the following, the example usages are presented using *N*-gram counts files in `test_data` copied from [`tongrams`](https://github.com/jermp/tongrams).\n\n### 1. Sorting\n\nTo build the trie index, you need to sort your *N*-gram counts files.\nFirst, prepare unigram counts files sorted by the counts for making a resulting index smaller, as\n\n```\n$ cat test_data/1-grams.sorted\n8761\nthe\t3681\nis\t1869\na\t1778\nof\t1672\nto\t1638\nand\t1202\n...\n```\n\nBy using the unigram file as a vocabulary, the executable `sort_grams` sorts a *N*-gram counts file.\n\nHere, we sort an unsorted bigram counts file, as\n\n```\n$ cat test_data/2-grams\n38900\nways than\t1\nmay come\t1\nfrequent causes\t1\nway has\t1\nin which\t14\n...\n```\n\nYou can sort the bigram file (in a gzip format) and write `test_data/2-grams.sorted` with the following command:\n\n```\n$ cargo run --release -p tools --bin sort_grams -- -i test_data/2-grams.gz -v test_data/1-grams.sorted.gz -o test_data/2-grams.sorted\nLoading the vocabulary: \"test_data/1-grams.sorted.gz\"\nLoading the records: \"test_data/2-grams.gz\"\nSorting the records\nWriting the index into \"test_data/2-grams.sorted.gz\"\n```\n\nThe output file format can be specified with `-f`, and the default setting is `.gz`. The resulting file will be\n\n```\n$ cat test_data/2-grams.sorted\n38900\nthe //\t1\nthe function\t94\nthe if\t3\nthe code\t126\nthe compiler\t117\n...\n```\n\n\n### 2. Indexing\n\nThe executable `index` builds a language model from (sorted) *N*-gram counts files, named `\u003corder\u003e-grams.sorted.gz`, and writes it into a binary file. The input file format can be specified with `-f`, and the default setting is `.gz`.\n\nFor example, the following command builds a language model from *N*-gram counts files (*N*=1..5) placed in directory `test_data` and writes it into `index.bin`.\n\n```\n$ cargo run --release -p tools --bin index -- -n 5 -i test_data -o index.bin\nInput files: [\"test_data/1-grams.sorted.gz\", \"test_data/2-grams.sorted.gz\", \"test_data/3-grams.sorted.gz\", \"test_data/4-grams.sorted.gz\", \"test_data/5-grams.sorted.gz\"]\nCounstructing the index...\nElapsed time: 0.190 [sec]\n252550 grams are stored.\nWriting the index into \"index.bin\"...\nIndex size: 659366 bytes (0.629 MiB)\nBytes per gram: 2.611 bytes\n```\n\nAs the standard output shows, the model file takes only 2.6 bytes per gram.\n\n### 3. Lookup\n\nThe executable `lookup` provides a demo to lookup *N*-grams, as follows.\n\n```\n$ cargo run --release -p tools --bin lookup -- -i index.bin \nLoading the index from \"index.bin\"...\nPerforming the lookup...\n\u003e take advantage\ncount = 8\n\u003e only 64-bit execution\ncount = 1\n\u003e Elias Fano\nNot found\n\u003e \nGood bye!\n```\n\n### 4. Memory statistics\n\nThe executable `stats` shows the breakdowns of memory usages for each component.\n\n```\n$ cargo run --release -p tools --bin stats -- -i index.bin\nLoading the index from \"index.bin\"...\n{\"arrays\":[{\"pointers\":5927,\"token_ids\":55186},{\"pointers\":19745,\"token_ids\":92416},{\"pointers\":25853,\"token_ids\":107094},{\"pointers\":28135,\"token_ids\":111994}],\"count_ranks\":[{\"count_ranks\":5350},{\"count_ranks\":12106},{\"count_ranks\":13976},{\"count_ranks\":14582},{\"count_ranks\":14802}],\"counts\":[{\"count\":296},{\"count\":136},{\"count\":72},{\"count\":56},{\"count\":56}],\"vocab\":{\"data\":151560}}\n```\n\n## Benchmark\n\nAt the directory `bench`, you can measure lookup times using *N*-gram data in `test_data` with the following command:\n\n```\n$ RUSTFLAGS=\"-C target-cpu=native\" cargo bench\ncount_lookup/tongrams/EliasFanoTrieCountLm\n                        time:   [3.1818 ms 3.1867 ms 3.1936 ms]\n```\n\nThe reported time is the total elapsed time for looking up 5K random grams.\nThe above result was actually obtained on my laptop PC (Intel i7, 16GB RAM),\ni.e., `EliasFanoTrieCountLm` can look up a gram in 0.64 micro sec on average.\n\n## Todo\n\n- Add fast elias-fano and pertitioned elias-fano\n- Add minimal perfect hashing\n- Add remapping\n- Support probability scores\n- Make `sucds::EliasFano` faster\n\n## Licensing\n\nThis library is free software provided under MIT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkampersanda%2Ftongrams-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkampersanda%2Ftongrams-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkampersanda%2Ftongrams-rs/lists"}