https://github.com/maledorak/single-token-words

List of single token words for LLM usage
https://github.com/maledorak/single-token-words

llm openai tiktoken tokenizer

Last synced: about 2 months ago
JSON representation

List of single token words for LLM usage

Host: GitHub
URL: https://github.com/maledorak/single-token-words
Owner: maledorak
License: mit
Created: 2024-11-12T22:32:56.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-12T18:35:05.000Z (over 1 year ago)
Last Synced: 2025-10-04T10:55:49.726Z (9 months ago)
Topics: llm, openai, tiktoken, tokenizer
Language: Python
Homepage:
Size: 2.35 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Single token words and names

This project is to find all the words and english first names that can be encoded by a single token in different LLM tokenizers.

Useful if you want to map some large text chunks before sending to LLM.

## Why do we need this?

Sometimes you need to send a large structurized text to LLM, like JSON, HTML, etc.

In this case, often you don't need to encode the whole data, like IDs, html classes, html urls, etc. - This even can be harmful to your wallet and LLM performance!

You can just map this data to some single token words.

### Examples

Examples was used with [OpenAI Tokenizer](https://platform.openai.com/tokenizer).

#### JSON

**Note:** This is the token count of the full data:

![Json with full data](./assets/example-json-full.png)

**Note:** And this is the token count of the same data, but with mapped words:

![Json with lite data](./assets/example-json-lite.png)

**Note:** You can see that the token count is much less. Which on scale thousands of requests can save you a lot of money!

## How to use

Just copy the output files from [single_token_words](single_token_words) folder to your project and use them.

There are json and csv versions of the files.

In [single_token_words_info](single_token_words_info) folder you can find some info about the words.

### Python

You can make some simple class for getting unique single token words from the file and use it to map your data.

```python

from typing import List

import json

class SingleTokenWords:

    def __init__(self):

        self._words = set(self._load_words())

    def _load_words(self) -> List[str]:

        with open('single_token_words.json', 'r') as file:

            return json.load(file)

        

    def get_word(self) -> str:

        return self._words.pop()

```

or with names:

```python

class SingleTokenNames:

    def __init__(self):

        self._names = set(self._load_names())

    def _load_names(self) -> List[str]:

        with open('single_token_names.json', 'r') as file:

            return json.load(file)

        

    def get_name(self) -> str:

        return self._names.pop()

```

## Supported languages

### Words   

- English - based on [English-Valid-Words](https://github.com/Maximax67/English-Valid-Words) repository

### Names

- English - based on [names-dataset](https://pypi.org/project/names-dataset/) library

## Supported tokenizers

- openai_tiktoken

    - cl100k_base (gpt-4, gpt-3.5-turbo)

    - o200k_base (gpt-4o, gpt-4o-mini, o1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/maledorak/single-token-words

Awesome Lists containing this project

README