{"id":13861695,"url":"https://github.com/simonw/ttok","last_synced_at":"2025-04-04T21:06:02.567Z","repository":{"id":166966741,"uuid":"642509669","full_name":"simonw/ttok","owner":"simonw","description":"Count and truncate text based on tokens","archived":false,"fork":false,"pushed_at":"2024-05-02T23:37:54.000Z","size":41,"stargazers_count":320,"open_issues_count":13,"forks_count":10,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-28T20:05:58.715Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-18T18:22:59.000Z","updated_at":"2025-03-28T04:09:06.000Z","dependencies_parsed_at":"2024-01-18T04:51:52.123Z","dependency_job_id":"2534c07c-4e9b-4e6c-975a-e64b128d2607","html_url":"https://github.com/simonw/ttok","commit_stats":null,"previous_names":["simonw/ttok"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fttok","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fttok/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fttok/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fttok/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonw","download_url":"https://codeload.github.com/simonw/ttok/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247249524,"owners_count":20908212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T06:01:28.210Z","updated_at":"2025-04-04T21:06:02.530Z","avatar_url":"https://github.com/simonw.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# ttok\n\n[![PyPI](https://img.shields.io/pypi/v/ttok.svg)](https://pypi.org/project/ttok/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/ttok?include_prereleases\u0026label=changelog)](https://github.com/simonw/ttok/releases)\n[![Tests](https://github.com/simonw/ttok/workflows/Test/badge.svg)](https://github.com/simonw/ttok/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/ttok/blob/master/LICENSE)\n\nCount and truncate text based on tokens\n\n## Background\n\nLarge language models such as GPT-3.5 and GPT-4 work in terms of tokens.\n\nThis tool can count tokens, using OpenAI's [tiktoken](https://github.com/openai/tiktoken) library.\n\nIt can also truncate text to a specified number of tokens.\n\nSee [llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.\n\n## Installation\n\nInstall this tool using `pip`:\n```bash\npip install ttok\n```\nOr using Homebrew:\n```bash\nbrew install simonw/llm/ttok\n```\n\n## Counting tokens\n\nProvide text as arguments to this tool to count tokens:\n\n```bash\nttok Hello world\n```\n```\n2\n```\nYou can also pipe text into the tool:\n```bash\necho -n \"Hello world\" | ttok\n```\n```\n2\n```\nHere the `echo -n` option prevents echo from adding a newline - without that you would get a token count of 3.\n\nTo pipe in text and then append extra tokens from arguments, use the `-i -` option:\n\n```bash\necho -n \"Hello world\" | ttok more text -i -\n```\n```\n6\n```\n## Different models\n\nBy default, the tokenizer model for GPT-3.5 and GPT-4 is used.\n\nTo use the model for GPT-2 and GPT-3, add `--model gpt2`:\n\n```bash\nttok boo Hello there this is -m gpt2\n```\n```\n6\n```\nCompared to GPT-3.5:\n```bash\nttok boo Hello there this is\n```\n```\n5\n```\nFurther model options are [documented here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).\n\n## Truncating text\n\nUse the `-t 10` or `--truncate 10` option to truncate text to a specified number of tokens:\n\n```bash\nttok This is too many tokens -t 3\n```\n```\nThis is too\n```\n\n## Viewing tokens\n\nThe `--encode` option can be used to view the integer token IDs for the incoming text:\n\n```bash\nttok Hello world --encode\n```\n```\n9906 1917\n```\nThe `--decode` method reverses this process:\n\n```bash\nttok 9906 1917 --decode\n```\n```\nHello world\n```\nAdd `--tokens` to either of these options to see a detailed breakdown of the tokens:\n\n```bash\nttok Hello world --encode --tokens\n```\n```\n[b'Hello', b' world']\n```\n\n## Available models\n\nThis is the full list of available models and their corresponding encodings. Model names and encoding names are valid for the `-m/--model` option.\n\n\u003c!-- [[[cog\nimport cog\nimport tiktoken\noutput = []\nfor key, value in tiktoken.model.MODEL_TO_ENCODING.items():\n    output.append(\"- `{}` (`{}`)\".format(key, value))\ncog.out(\"\\n\".join(output))\n]]] --\u003e\n- `gpt-4` (`cl100k_base`)\n- `gpt-3.5-turbo` (`cl100k_base`)\n- `gpt-3.5` (`cl100k_base`)\n- `gpt-35-turbo` (`cl100k_base`)\n- `davinci-002` (`cl100k_base`)\n- `babbage-002` (`cl100k_base`)\n- `text-embedding-ada-002` (`cl100k_base`)\n- `text-embedding-3-small` (`cl100k_base`)\n- `text-embedding-3-large` (`cl100k_base`)\n- `text-davinci-003` (`p50k_base`)\n- `text-davinci-002` (`p50k_base`)\n- `text-davinci-001` (`r50k_base`)\n- `text-curie-001` (`r50k_base`)\n- `text-babbage-001` (`r50k_base`)\n- `text-ada-001` (`r50k_base`)\n- `davinci` (`r50k_base`)\n- `curie` (`r50k_base`)\n- `babbage` (`r50k_base`)\n- `ada` (`r50k_base`)\n- `code-davinci-002` (`p50k_base`)\n- `code-davinci-001` (`p50k_base`)\n- `code-cushman-002` (`p50k_base`)\n- `code-cushman-001` (`p50k_base`)\n- `davinci-codex` (`p50k_base`)\n- `cushman-codex` (`p50k_base`)\n- `text-davinci-edit-001` (`p50k_edit`)\n- `code-davinci-edit-001` (`p50k_edit`)\n- `text-similarity-davinci-001` (`r50k_base`)\n- `text-similarity-curie-001` (`r50k_base`)\n- `text-similarity-babbage-001` (`r50k_base`)\n- `text-similarity-ada-001` (`r50k_base`)\n- `text-search-davinci-doc-001` (`r50k_base`)\n- `text-search-curie-doc-001` (`r50k_base`)\n- `text-search-babbage-doc-001` (`r50k_base`)\n- `text-search-ada-doc-001` (`r50k_base`)\n- `code-search-babbage-code-001` (`r50k_base`)\n- `code-search-ada-code-001` (`r50k_base`)\n- `gpt2` (`gpt2`)\n- `gpt-2` (`gpt2`)\n\u003c!-- [[[end]]] --\u003e\n\n## ttok --help\n\n\u003c!-- [[[cog\nfrom ttok import cli\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(cli.cli, [\"--help\"])\nhelp = result.output.replace(\"Usage: cli\", \"Usage: ttok\")\ncog.out(\n    \"```\\n{}\\n```\".format(help)\n)\n]]] --\u003e\n```\nUsage: ttok [OPTIONS] [PROMPT]...\n\n  Count and truncate text based on tokens\n\n  To count tokens for text passed as arguments:\n\n      ttok one two three\n\n  To count tokens from stdin:\n\n      cat input.txt | ttok\n\n  To truncate to 100 tokens:\n\n      cat input.txt | ttok -t 100\n\n  To truncate to 100 tokens using the gpt2 model:\n\n      cat input.txt | ttok -t 100 -m gpt2\n\n  To view token integers:\n\n      cat input.txt | ttok --encode\n\n  To convert tokens back to text:\n\n      ttok 9906 1917 --decode\n\n  To see the details of the tokens:\n\n      ttok \"hello world\" --tokens\n\n  Outputs:\n\n      [b'hello', b' world']\n\nOptions:\n  --version               Show the version and exit.\n  -i, --input FILENAME\n  -t, --truncate INTEGER  Truncate to this many tokens\n  -m, --model TEXT        Which model to use\n  --encode, --tokens      Output token integers\n  --decode                Convert token integers to text\n  --tokens                Output full tokens\n  --allow-special         Do not error on special tokens\n  --help                  Show this message and exit.\n\n```\n\u003c!-- [[[end]]] --\u003e\n\nYou can also run this command using:\n\n```bash\npython -m ttok --help\n```\n\n## Development\n\nTo contribute to this tool, first checkout the code. Then create a new virtual environment:\n\n```bash\ncd ttok\npython -m venv venv\nsource venv/bin/activate\n```\n\nNow install the dependencies and test dependencies:\n\n```bash\npip install -e '.[test]'\n```\n\nTo run the tests:\n\n```bash\npytest\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fttok","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonw%2Fttok","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fttok/lists"}