{"id":19428889,"url":"https://github.com/eleutherai/tokengrams","last_synced_at":"2025-07-17T01:32:59.598Z","repository":{"id":215339647,"uuid":"734397698","full_name":"EleutherAI/tokengrams","owner":"EleutherAI","description":"Efficiently computing \u0026 storing token n-grams from large corpora","archived":false,"fork":false,"pushed_at":"2024-08-16T04:45:40.000Z","size":1509,"stargazers_count":13,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-16T05:45:50.768Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EleutherAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-21T15:22:19.000Z","updated_at":"2024-08-16T04:45:39.000Z","dependencies_parsed_at":"2024-01-12T01:38:49.722Z","dependency_job_id":"709d28ff-af01-416c-ba0e-0e685239edbc","html_url":"https://github.com/EleutherAI/tokengrams","commit_stats":null,"previous_names":["eleutherai/tokengrams"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Ftokengrams","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Ftokengrams/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Ftokengrams/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Ftokengrams/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EleutherAI","download_url":"https://codeload.github.com/EleutherAI/tokengrams/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223961111,"owners_count":17232251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T14:17:05.348Z","updated_at":"2024-11-10T14:17:06.237Z","avatar_url":"https://github.com/EleutherAI.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tokengrams\nTokengrams allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a [suffix array](https://en.wikipedia.org/wiki/Suffix_array) index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.\n\nOur code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.\n\nThe backend is written in Rust, and the Python bindings are generated using [PyO3](https://github.com/PyO3/pyo3).\n\n# Installation\n\n```bash\npip install tokengrams\n```\n\n# Usage\n\n[Full text generation demo](https://colab.research.google.com/drive/1CEHoIjLboGl8YPbIqnWJlPYMm1wVOrrj?usp=sharing)\n\n## Preparing data\n\nUse a dataset of u16 or u32 tokens, or prepare one from a HuggingFace dataset.\n\n```python\n# Get pre-tokenized dataset\nfrom huggingface_hub import HfApi, hf_hub_download\n\nhf_hub_download(\n  repo_id=\"EleutherAI/pile-standard-pythia-preshuffled\", \n  repo_type=\"dataset\", \n  filename=\"document-00000-of-00020.bin\", \n  local_dir=\".\"\n)\n```\n```python\n# Tokenize HF dataset\nfrom tokengrams import tokenize_hf_dataset\nfrom datasets import load_dataset\nfrom transformers import AutoTokenizer\n\ntokenize_hf_dataset(\n    dataset=load_dataset(\"EleutherAI/lambada_openai\", \"en\"),\n    tokenizer=AutoTokenizer.from_pretrained(\"EleutherAI/pythia-160m\"),\n    output_path=\"lambada.bin\",\n    text_key=\"text\",\n    append_eod=True,\n    workers=1,\n)\n```\n\n## Building an index\n```python\nfrom tokengrams import MemmapIndex\n\n# Create a new index from an on-disk corpus of u16 tokens and save it to a .idx file. \n# Set verbose to true to include a progress bar for the index sort.\nindex = MemmapIndex.build(\n    \"document-00000-of-00020.bin\",\n    \"document-00000-of-00020.idx\",\n    vocab=2**16,\n    verbose=True\n)\n\n# True for any valid index.\nprint(index.is_sorted())\n  \n# Get the count of \"hello world\" in the corpus.\nfrom transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\"EleutherAI/pythia-160m\")\nprint(index.count(tokenizer.encode(\"hello world\")))\n\n# You can now load the index from disk later using __init__\nindex = MemmapIndex(\n    \"document-00000-of-00020.bin\",\n    \"document-00000-of-00020.idx\",\n    vocab=2**16\n)\n```\n\n## Using an index\n\n```python\n# Count how often each token in the corpus succeeds \"hello world\".\nprint(index.count_next(tokenizer.encode(\"hello world\")))\nprint(index.batch_count_next(\n    [tokenizer.encode(\"hello world\"), tokenizer.encode(\"hello universe\")]\n))\n\n# Get smoothed probabilities for query continuations\nprint(index.smoothed_probs(tokenizer.encode(\"hello world\")))\nprint(index.batch_smoothed_probs(\n    [tokenizer.encode(\"hello world\"), tokenizer.encode(\"hello universe\")]\n))\n\n# Autoregressively sample 10 tokens using 5-gram language statistics. Initial\n# gram statistics are derived from the query, with lower order gram statistics used \n# until the sequence contains at least 5 tokens.\nprint(index.sample_unsmoothed(tokenizer.encode(\"hello world\"), n=5, k=10, num_samples=20))\nprint(index.sample_smoothed(tokenizer.encode(\"hello world\"), n=5, k=10, num_samples=20))\n\n# Query whether the corpus contains \"hello world\"\nprint(index.contains(tokenizer.encode(\"hello world\")))\n\n# Get all n-grams beginning with \"hello world\" in the corpus\nprint(index.positions(tokenizer.encode(\"hello world\")))\n```\n\n## Scaling\n\nCorpora small enough to fit in memory can use an InMemoryIndex:\n\n```python\nfrom tokengrams import InMemoryIndex\n\ntokens = [0, 1, 2, 3, 4]\nindex = InMemoryIndex(tokens, vocab=5)\n```\n\nLarger corpora must use a MemmapIndex.\n\nSome systems struggle with memory mapping extremely large tables (e.g. 40 billion tokens), causing unexpected bus errors. To prevent this split the corpus into shards then use a ShardedMemmapIndex to sort and query the table shard by shard:\n\n```python\nfrom tokengrams import ShardedMemmapIndex\nfrom huggingface_hub import HfApi, hf_hub_download\n\nfiles = [\n    file for file in HfApi().list_repo_files(\"EleutherAI/pile-standard-pythia-preshuffled\", repo_type=\"dataset\")\n    if file.endswith('.bin')\n]\n\nindex_paths = []\nfor file in files:\n    hf_hub_download(\"EleutherAI/pile-standard-pythia-preshuffled\", repo_type=\"dataset\", filename=file, local_dir=\".\")\n    index_paths.append((file, f'{file.rstrip(\".bin\")}.idx'))\n\nindex = ShardedMemmapIndex.build(index_paths, vocab=2**16, verbose=True)\n```\n### Tokens\n\nTokengrams builds indices from on-disk corpora of either u16 or u32 tokens, supporting a maximum vocabulary size of 2\u003csup\u003e32\u003c/sup\u003e. In practice, however, vocabulary size is limited by the length of the largest word size vector the machine can allocate in memory. \n\nCorpora with vocabulary sizes smaller than 2\u003csup\u003e16\u003c/sup\u003e must use u16 tokens.\n\n## Performance\n\nIndex build times for in-memory corpora scale inversely with the number of available CPU threads, whereas if the index reads from or writes to a file it is likely to be IO bound.\n\nThe time complexities of count_next(query) and sample_unsmoothed(query) are O(n log n), where n is ~ the number of completions for the query. The time complexity of sample_smoothed(query) is O(m n log n) where m is the n-gram order.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"./tokengrams/benchmark/MemmapIndex_build_times.png\" alt=\"Sample build times for an IO bound index\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./tokengrams/benchmark/MemmapIndex_count_next_times.png\" alt=\"Sample count_next times for an IO bound index\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n# Development\n\n```bash\ncargo build\ncargo test\n```\n\nDevelop Python bindings:\n\n```bash\npip install maturin\nmaturin develop\npytest\n```\n\n# Support\n\nThe best way to get support is to open an issue on this repo or post in #interp-across-time in the [EleutherAI Discord server](https://discord.gg/eleutherai). If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleutherai%2Ftokengrams","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feleutherai%2Ftokengrams","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleutherai%2Ftokengrams/lists"}