https://github.com/lancekrogers/tcount
Count tokens of files and directories
https://github.com/lancekrogers/tcount
ai-tools counter developer-tools llms token-optimization tokens
Last synced: 12 days ago
JSON representation
Count tokens of files and directories
- Host: GitHub
- URL: https://github.com/lancekrogers/tcount
- Owner: lancekrogers
- License: mit
- Created: 2026-01-28T22:32:12.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-04-19T06:48:28.000Z (about 2 months ago)
- Last Synced: 2026-04-19T08:33:39.113Z (about 2 months ago)
- Topics: ai-tools, counter, developer-tools, llms, token-optimization, tokens
- Language: Go
- Homepage:
- Size: 2.86 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# tcount
A fast, zero-network token counter for LLM workflows. Count tokens in files and directories using exact OpenAI tokenizers, Claude approximations, SentencePiece vocabularies, and generic estimation — all from a single CLI.
## Features
- **Exact BPE tokenization** — offline, no network calls. Supports GPT-5, GPT-4.1, GPT-4o, o-series, and legacy GPT-4/3.5.
- **Claude approximation** calibrated for Anthropic models
- **SentencePiece** exact tokenization for Llama and other open-source models (bring your own `.model` file)
- **Context window usage** — see what percentage of a model's context you're consuming
- **Cost estimates** with per-1M-token pricing via `--cost`
- **Provider filtering** — compare models from a specific provider
- **Directory scanning** with `.gitignore` support and binary file detection
- **JSON output** for scripting and pipelines
## Install
### npm / pnpm / bun (macOS & Linux)
```bash
npm install -g @obedience-corp/tcount
# or
pnpm add -g @obedience-corp/tcount
# or
bun add -g @obedience-corp/tcount
```
The npm package downloads the official release binary for your platform (with checksum verification) on first install.
### Homebrew (macOS & Linux)
```bash
brew install lancekrogers/tap/tcount
```
### Go
```bash
go install github.com/lancekrogers/tcount/cmd/tcount@latest
```
### From source
```bash
git clone https://github.com/lancekrogers/tcount.git
cd tcount
go build -o bin/tcount ./cmd/tcount
```
### Binary releases
Pre-built binaries for macOS, Linux, and Windows are available on the [releases page](https://github.com/lancekrogers/tcount/releases).
## Quick Start
```bash
# Count tokens in a file
tcount myfile.txt
# Specific model
tcount --model gpt-5 prompt.md
# All methods with cost estimates
tcount --all --cost prompt.md
# Filter by provider
tcount --provider openai prompt.md
# Recursive directory count
tcount -r ./src
# JSON output for scripting
tcount --json document.md
```
## Supported Models
### OpenAI
| Model | Encoding | Context |
|-------|----------|---------|
| `gpt-5`, `gpt-5-mini`, `gpt-5-nano` | o200k_base | 400K |
| `gpt-5.1`, `gpt-5.2` | o200k_base | 400K |
| `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano` | o200k_base | 1M |
| `gpt-4o`, `gpt-4o-mini` | o200k_base | 128K |
| `o3`, `o3-mini`, `o4-mini` | o200k_base | 200K |
| `gpt-4`, `gpt-4-turbo` | cl100k_base | 8K–128K |
| `gpt-3.5-turbo` | cl100k_base | 16K |
### Anthropic
| Model | Method | Context |
|-------|--------|---------|
| `claude-opus-4.6`, `claude-opus-4.5` | Approximation | 200K |
| `claude-opus-4.1`, `claude-opus-4` | Approximation | 200K |
| `claude-sonnet-4.6`, `claude-sonnet-4.5`, `claude-sonnet-4` | Approximation | 200K |
| `claude-haiku-4.5`, `claude-haiku-3.5`, `claude-haiku-3` | Approximation | 200K |
| `claude-opus-3` (deprecated) | Approximation | 200K |
### Meta (Llama)
| Model | Method | Context |
|-------|--------|---------|
| `llama-4-scout`, `llama-4-maverick` | tiktoken approx / SentencePiece | 128K |
| `llama-3.1-8b`, `llama-3.1-70b`, `llama-3.1-405b` | tiktoken approx / SentencePiece | 128K |
### DeepSeek
| Model | Method | Context |
|-------|--------|---------|
| `deepseek-v2`, `deepseek-v3`, `deepseek-coder-v2` | tiktoken approx | 128K |
### Alibaba (Qwen)
| Model | Method | Context |
|-------|--------|---------|
| `qwen-2.5-7b`, `qwen-2.5-14b`, `qwen-2.5-72b` | tiktoken approx | 32K |
| `qwen-3-72b` | tiktoken approx | 32K |
### Microsoft (Phi)
| Model | Method | Context |
|-------|--------|---------|
| `phi-3-mini`, `phi-3-small`, `phi-3-medium` | tiktoken approx | 128K |
## Tokenization Methods
| Method | Accuracy | When Used |
|--------|----------|-----------|
| tiktoken (o200k_base) | Exact | GPT-5.x, GPT-4.1, GPT-4o, o3, o4-mini |
| tiktoken (cl100k_base) | Exact | GPT-4, GPT-3.5 |
| Claude approximation | Estimated | All Claude models (÷3.8 char ratio) |
| SentencePiece | Exact | Llama with `--vocab-file` |
| tiktoken approximation | Approximate | Llama, DeepSeek, Qwen, Phi (no vocab file) |
| Character-based | Approximate | Any (chars ÷ configurable ratio, default 4.0) |
| Word-based | Approximate | Any (words × configurable multiplier, default 1.33) |
| Whitespace split | Approximate | Any (raw word count as lower bound) |
## Usage
```
tcount [file|directory] [flags]
```
### Flags
| Flag | Short | Description |
|------|-------|-------------|
| `--model` | | Specific model tokenizer |
| `--models` | `-m` | Show encoding-to-model lookup table |
| `--provider` | | Filter by provider: `openai`, `anthropic`, `meta`, `deepseek`, `alibaba`, `microsoft`, `all` |
| `--vocab-file` | | Path to SentencePiece `.model` file for exact Llama tokenization |
| `--all` | | Show all counting methods |
| `--json` | | JSON output |
| `--cost` | | Include cost estimates (per 1M tokens) |
| `--recursive` | `-r` | Recursively count files in a directory |
| `--directory` | `-d` | Alias for `--recursive` |
| `--chars-per-token` | | Character/token ratio for approximation (default: 4.0) |
| `--words-per-token` | | Words/token ratio for approximation (default: 0.75) |
| `--verbose` | | Show additional details |
| `--no-color` | | Disable color output |
## Examples
### Single model
```
$ tcount --model gpt-5 document.md
Token Count Report for: document.md
═══════════════════════════════════════════════════════
Basic Statistics:
Characters: 5451
Words: 662
Lines: 222
Token Counts by Method:
┌─────────────────────────┬──────────┬────────────┬──────────────────┐
│ Method │ Tokens │ Accuracy │ Context Usage │
├─────────────────────────┼──────────┼────────────┼──────────────────┤
│ GPT (gpt-5) │ 1445 │ Exact │ 0.7% of 200K │
└─────────────────────────┴──────────┴────────────┴──────────────────┘
```
### All methods with costs
```
$ tcount --all --cost document.md
Token Count Report for: document.md
═══════════════════════════════════════════════════════
Basic Statistics:
Characters: 5451
Words: 662
Lines: 222
Token Counts by Method:
┌─────────────────────────┬──────────┬────────────┬──────────────────┐
│ Method │ Tokens │ Accuracy │ Context Usage │
├─────────────────────────┼──────────┼────────────┼──────────────────┤
│ GPT (gpt-5) │ 1445 │ Exact │ 0.7% of 200K │
│ GPT (gpt-4o) │ 1445 │ Exact │ 1.1% of 128K │
│ Claude (approx) │ 1434 │ Estimated │ 0.7% of 200K │
│ Llama (llama-3.1-8b) │ 1445 │ Exact │ 1.1% of 128K │
│ Character-based (÷4.0) │ 1362 │ Approx │ │
│ Word-based (×1.33) │ 882 │ Approx │ │
│ Whitespace split │ 662 │ Approx │ │
└─────────────────────────┴──────────┴────────────┴──────────────────┘
Cost Estimates (Input):
gpt-5: $0.0018 ($1.25/1M tokens)
gpt-4o: $0.0036 ($2.50/1M tokens)
claude-sonnet-4.6: $0.0043 ($3.00/1M tokens)
claude-sonnet-4.5: $0.0043 ($3.00/1M tokens)
```
### SentencePiece for exact Llama tokenization
```bash
# Download tokenizer.model from HuggingFace (requires auth):
# https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/original/tokenizer.model
tcount --model llama-3.1-8b --vocab-file /path/to/tokenizer.model document.md
```
Without `--vocab-file`, Llama models use a tiktoken-based approximation.
### Directory scanning
```
$ tcount -r --verbose tokenizer/
Found 4 text files (skipped 0 binary, 0 ignored)
Token Count Report for: tokenizer/ (directory)
═══════════════════════════════════════════════════════
Basic Statistics:
Files: 4
Characters: 14929
Words: 1906
Lines: 612
Token Counts by Method:
┌─────────────────────────┬──────────┬────────────┬──────────────────┐
│ Method │ Tokens │ Accuracy │ Context Usage │
├─────────────────────────┼──────────┼────────────┼──────────────────┤
│ GPT (gpt-5) │ 4206 │ Exact │ 2.1% of 200K │
│ Claude (approx) │ 3928 │ Estimated │ 2.0% of 200K │
│ Character-based (÷4.0) │ 3732 │ Approx │ │
│ Word-based (×1.33) │ 2541 │ Approx │ │
│ Whitespace split │ 1906 │ Approx │ │
└─────────────────────────┴──────────┴────────────┴──────────────────┘
```
When scanning directories, tcount respects `.gitignore` rules, skips binary files and `.git` directories, and aggregates all text files into a combined count. Use `--verbose` to see file and skip statistics.
### JSON output
```
$ tcount --json --model gpt-5 document.md
{
"file_path": "document.md",
"file_size": 5451,
"characters": 5451,
"words": 662,
"lines": 222,
"methods": [
{
"name": "tiktoken_gpt_5",
"display_name": "GPT (gpt-5)",
"tokens": 1445,
"is_exact": true,
"context_window": 200000
}
]
}
```
```bash
# Extract a specific count
tcount --json myfile.txt | jq '.methods[] | select(.name == "tiktoken_gpt_5") | .tokens'
# Batch count all markdown files
for f in docs/*.md; do tcount --json "$f"; done | jq -s '.'
```
## Library Usage
tcount can be used as a Go library in your own projects.
### Installation
```bash
go get github.com/lancekrogers/tcount/tokenizer
```
### Basic Token Counting
```go
package main
import (
"context"
"fmt"
"log"
"github.com/lancekrogers/tcount/tokenizer"
)
func main() {
counter, err := tokenizer.NewCounter(tokenizer.CounterOptions{})
if err != nil {
log.Fatal(err)
}
ctx := context.Background()
result, err := counter.Count(ctx, "Hello, world!", "gpt-4o", false)
if err != nil {
log.Fatal(err)
}
for _, m := range result.Methods {
if m.IsExact {
fmt.Printf("Tokens: %d (exact, %s)\n", m.Tokens, m.DisplayName)
}
}
}
```
### File and Directory Counting
```go
ctx := context.Background()
// Count tokens in a single file
result, err := counter.CountFile(ctx, "document.md", "gpt-4o", false)
// Count tokens across a directory (respects .gitignore, skips binaries)
result, err := counter.CountDirectory(ctx, "./src", "", true)
fmt.Printf("Files: %d, Tokens: %d\n", result.FileCount, result.Methods[0].Tokens)
```
### Direct BPE Tokenizer Access
```go
tok, err := tokenizer.NewBPETokenizer("gpt-4o")
if err != nil {
log.Fatal(err)
}
count, _ := tok.CountTokens("Hello, world!")
fmt.Printf("Tokens: %d, Exact: %v\n", count, tok.IsExact())
```
### Model Discovery
```go
// Get metadata for a specific model
meta := tokenizer.GetModelMetadata("gpt-4o")
fmt.Printf("Encoding: %s, Context: %d\n", meta.Encoding, meta.ContextWindow)
// List all registered models
models := tokenizer.ListModels()
// List models by provider
openaiModels := tokenizer.ListModelsByProvider(tokenizer.ProviderOpenAI)
```
### Cost Estimation
```go
ctx := context.Background()
result, _ := counter.Count(ctx, text, "gpt-4o", false)
costs := tokenizer.CalculateCosts(result.Methods)
for _, c := range costs {
fmt.Printf("%s: $%.4f\n", c.Model, c.Cost)
}
```
## Development
Requires [just](https://github.com/casey/just) for the build system.
```bash
just # List all recipes
just build # Build (with fmt + vet)
just test all # Run all tests
just test unit # Unit tests only
just test integration # Integration tests only
just test coverage # Coverage report
just test bench # Benchmarks
just release all # Cross-compile for all platforms
```
## License
MIT License. See [LICENSE](LICENSE) for details.