{"id":31541111,"url":"https://github.com/maledorak/single-token-words","last_synced_at":"2026-05-14T23:07:01.203Z","repository":{"id":262529815,"uuid":"887565626","full_name":"maledorak/single-token-words","owner":"maledorak","description":"List of single token words for LLM usage","archived":false,"fork":false,"pushed_at":"2024-12-12T18:35:05.000Z","size":2460,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-04T10:55:49.726Z","etag":null,"topics":["llm","openai","tiktoken","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maledorak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-12T22:32:56.000Z","updated_at":"2024-12-12T18:35:10.000Z","dependencies_parsed_at":"2024-11-12T23:37:57.383Z","dependency_job_id":null,"html_url":"https://github.com/maledorak/single-token-words","commit_stats":null,"previous_names":["maledorak/single-token-words"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/maledorak/single-token-words","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maledorak%2Fsingle-token-words","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maledorak%2Fsingle-token-words/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maledorak%2Fsingle-token-words/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maledorak%2Fsingle-token-words/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maledorak","download_url":"https://codeload.github.com/maledorak/single-token-words/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maledorak%2Fsingle-token-words/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33046787,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-14T02:00:06.663Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","openai","tiktoken","tokenizer"],"created_at":"2025-10-04T10:55:36.552Z","updated_at":"2026-05-14T23:07:01.194Z","avatar_url":"https://github.com/maledorak.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Single token words and names\n\nThis project is to find all the words and english first names that can be encoded by a single token in different LLM tokenizers.\n\nUseful if you want to map some large text chunks before sending to LLM.\n\n## Why do we need this?\n\nSometimes you need to send a large structurized text to LLM, like JSON, HTML, etc.\n\nIn this case, often you don't need to encode the whole data, like IDs, html classes, html urls, etc. - This even can be harmful to your wallet and LLM performance!\n\nYou can just map this data to some single token words.\n\n### Examples\n\nExamples was used with [OpenAI Tokenizer](https://platform.openai.com/tokenizer).\n\n#### JSON\n\n**Note:** This is the token count of the full data:\n\n![Json with full data](./assets/example-json-full.png)\n\n**Note:** And this is the token count of the same data, but with mapped words:\n\n![Json with lite data](./assets/example-json-lite.png)\n\n**Note:** You can see that the token count is much less. Which on scale thousands of requests can save you a lot of money!\n\n## How to use\n\nJust copy the output files from [single_token_words](single_token_words) folder to your project and use them.\n\nThere are json and csv versions of the files.\n\nIn [single_token_words_info](single_token_words_info) folder you can find some info about the words.\n\n### Python\n\nYou can make some simple class for getting unique single token words from the file and use it to map your data.\n\n```python\nfrom typing import List\nimport json\n\nclass SingleTokenWords:\n    def __init__(self):\n        self._words = set(self._load_words())\n\n    def _load_words(self) -\u003e List[str]:\n        with open('single_token_words.json', 'r') as file:\n            return json.load(file)\n        \n    def get_word(self) -\u003e str:\n        return self._words.pop()\n```\n\nor with names:\n\n```python\nclass SingleTokenNames:\n    def __init__(self):\n        self._names = set(self._load_names())\n\n    def _load_names(self) -\u003e List[str]:\n        with open('single_token_names.json', 'r') as file:\n            return json.load(file)\n        \n    def get_name(self) -\u003e str:\n        return self._names.pop()\n```\n\n## Supported languages\n\n\n### Words   \n\n- English - based on [English-Valid-Words](https://github.com/Maximax67/English-Valid-Words) repository\n\n### Names\n\n- English - based on [names-dataset](https://pypi.org/project/names-dataset/) library\n\n## Supported tokenizers\n\n- openai_tiktoken\n    - cl100k_base (gpt-4, gpt-3.5-turbo)\n    - o200k_base (gpt-4o, gpt-4o-mini, o1)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaledorak%2Fsingle-token-words","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaledorak%2Fsingle-token-words","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaledorak%2Fsingle-token-words/lists"}