https://github.com/deltartificial/tokenizer

Compute long files token lengths for different LLM models, built in Rust.
https://github.com/deltartificial/tokenizer

context llm rust token tokenizer

Last synced: 3 months ago
JSON representation

Compute long files token lengths for different LLM models, built in Rust.

Host: GitHub
URL: https://github.com/deltartificial/tokenizer
Owner: deltartificial
Created: 2025-04-06T19:56:59.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-04-06T20:27:43.000Z (3 months ago)
Last Synced: 2025-04-10T01:13:14.962Z (3 months ago)
Topics: context, llm, rust, token, tokenizer
Language: Rust
Homepage:
Size: 43.9 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Tokenizer

A CLI tool to compute token lengths of various file types (txt, md, pdf, html) for different LLM models.

## Features

- Calculate token counts for various file types (Text, Markdown, PDF, HTML)
- Support for multiple LLM models (configurable via config.json)
- Display token usage as percentage of context window
- Powered by HuggingFace tokenizers library

## Installation

Clone the repository and build the project:

```bash
git clone https://github.com/deltartificial/tokenizer.git
cd tokenizer
cargo build --release
```

## Usage

```bash
# Count tokens in a file using the default config.json
./target/release/tokenizer count path/to/your/file.txt

# Count tokens using a custom config file
./target/release/tokenizer count path/to/your/file.txt -c custom-config.json

# Count tokens using a specific tokenizer model
./target/release/tokenizer count path/to/your/file.html -t roberta-base
```

## Configuration

The tool uses a `config.json` file to define models and their context lengths. The default file includes configurations for various models:

```json
{
"models": [
{
"name": "gpt-3.5-turbo",
"context_length": 16385,
"encoding": "tiktoken"
},
{
"name": "gpt-4",
"context_length": 8192,
"encoding": "tiktoken"
},
{
"name": "bert-base",
"context_length": 512,
"encoding": "bert"
},
...
]
}
```

You can customize this file to add or modify models as needed.

## Tokenization

This tool uses HuggingFace's tokenizers library, which provides high-performance implementations of various tokenization algorithms. The default tokenizer used is BERT, but the architecture is designed to be easily extended to support different tokenizers.

## Supported File Types

- `.txt` - Plain text files
- `.md` - Markdown files
- `.pdf` - PDF documents (basic implementation)
- `.html`/`.htm` - HTML files (tags are stripped for token counting)

## Project Structure

The project follows a clean architecture approach:

- `domain`: Core business logic and entities
- `application`: Use cases that orchestrate the domain logic
- `infrastructure`: External services implementation (file reading, tokenization)
- `presentation`: User interface (CLI)

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deltartificial/tokenizer

Awesome Lists containing this project

README