https://github.com/lancekrogers/tcount

Count tokens of files and directories
https://github.com/lancekrogers/tcount
ai-tools counter developer-tools llms token-optimization tokens
Last synced: about 2 months ago
JSON representation
Count tokens of files and directories
Host: GitHub
URL: https://github.com/lancekrogers/tcount
Owner: lancekrogers
License: mit
Created: 2026-01-28T22:32:12.000Z (6 months ago)
Default Branch: main
Last Pushed: 2026-04-19T06:48:28.000Z (3 months ago)
Last Synced: 2026-04-19T08:33:39.113Z (3 months ago)
Topics: ai-tools, counter, developer-tools, llms, token-optimization, tokens
Language: Go
Homepage:
Size: 2.86 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project

README

          # tcount

A fast, zero-network token counter for LLM workflows. Count tokens in files and directories using exact OpenAI tokenizers, Claude approximations, SentencePiece vocabularies, and generic estimation — all from a single CLI.

## Features

- **Exact BPE tokenization** — offline, no network calls. Supports GPT-5, GPT-4.1, GPT-4o, o-series, and legacy GPT-4/3.5.

- **Claude approximation** calibrated for Anthropic models

- **SentencePiece** exact tokenization for Llama and other open-source models (bring your own `.model` file)

- **Context window usage** — see what percentage of a model's context you're consuming

- **Cost estimates** with per-1M-token pricing via `--cost`

- **Provider filtering** — compare models from a specific provider

- **Directory scanning** with `.gitignore` support and binary file detection

- **JSON output** for scripting and pipelines

## Install

### npm / pnpm / bun (macOS & Linux)

```bash

npm install -g @obedience-corp/tcount

# or

pnpm add -g @obedience-corp/tcount

# or

bun add -g @obedience-corp/tcount

```

The npm package downloads the official release binary for your platform (with checksum verification) on first install.

### Homebrew (macOS & Linux)

```bash

brew install lancekrogers/tap/tcount

```

### Go

```bash

go install github.com/lancekrogers/tcount/cmd/tcount@latest

```

### From source

```bash

git clone https://github.com/lancekrogers/tcount.git

cd tcount

go build -o bin/tcount ./cmd/tcount

```

### Binary releases

Pre-built binaries for macOS, Linux, and Windows are available on the [releases page](https://github.com/lancekrogers/tcount/releases).

## Quick Start

```bash

# Count tokens in a file

tcount myfile.txt

# Specific model

tcount --model gpt-5 prompt.md

# All methods with cost estimates

tcount --all --cost prompt.md

# Filter by provider

tcount --provider openai prompt.md

# Recursive directory count

tcount -r ./src

# JSON output for scripting

tcount --json document.md

```

## Supported Models

### OpenAI

| Model | Encoding | Context |

|-------|----------|---------|

| `gpt-5`, `gpt-5-mini`, `gpt-5-nano` | o200k_base | 400K |

| `gpt-5.1`, `gpt-5.2` | o200k_base | 400K |

| `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano` | o200k_base | 1M |

| `gpt-4o`, `gpt-4o-mini` | o200k_base | 128K |

| `o3`, `o3-mini`, `o4-mini` | o200k_base | 200K |

| `gpt-4`, `gpt-4-turbo` | cl100k_base | 8K–128K |

| `gpt-3.5-turbo` | cl100k_base | 16K |

### Anthropic

| Model | Method | Context |

|-------|--------|---------|

| `claude-opus-4.6`, `claude-opus-4.5` | Approximation | 200K |

| `claude-opus-4.1`, `claude-opus-4` | Approximation | 200K |

| `claude-sonnet-4.6`, `claude-sonnet-4.5`, `claude-sonnet-4` | Approximation | 200K |

| `claude-haiku-4.5`, `claude-haiku-3.5`, `claude-haiku-3` | Approximation | 200K |

| `claude-opus-3` (deprecated) | Approximation | 200K |

### Meta (Llama)

| Model | Method | Context |

|-------|--------|---------|

| `llama-4-scout`, `llama-4-maverick` | tiktoken approx / SentencePiece | 128K |

| `llama-3.1-8b`, `llama-3.1-70b`, `llama-3.1-405b` | tiktoken approx / SentencePiece | 128K |

### DeepSeek

| Model | Method | Context |

|-------|--------|---------|

| `deepseek-v2`, `deepseek-v3`, `deepseek-coder-v2` | tiktoken approx | 128K |

### Alibaba (Qwen)

| Model | Method | Context |

|-------|--------|---------|

| `qwen-2.5-7b`, `qwen-2.5-14b`, `qwen-2.5-72b` | tiktoken approx | 32K |

| `qwen-3-72b` | tiktoken approx | 32K |

### Microsoft (Phi)

| Model | Method | Context |

|-------|--------|---------|

| `phi-3-mini`, `phi-3-small`, `phi-3-medium` | tiktoken approx | 128K |

## Tokenization Methods

| Method | Accuracy | When Used |

|--------|----------|-----------|

| tiktoken (o200k_base) | Exact | GPT-5.x, GPT-4.1, GPT-4o, o3, o4-mini |

| tiktoken (cl100k_base) | Exact | GPT-4, GPT-3.5 |

| Claude approximation | Estimated | All Claude models (÷3.8 char ratio) |

| SentencePiece | Exact | Llama with `--vocab-file` |

| tiktoken approximation | Approximate | Llama, DeepSeek, Qwen, Phi (no vocab file) |

| Character-based | Approximate | Any (chars ÷ configurable ratio, default 4.0) |

| Word-based | Approximate | Any (words × configurable multiplier, default 1.33) |

| Whitespace split | Approximate | Any (raw word count as lower bound) |

## Usage

```

tcount [file|directory] [flags]

```

### Flags

| Flag | Short | Description |

|------|-------|-------------|

| `--model` | | Specific model tokenizer |

| `--models` | `-m` | Show encoding-to-model lookup table |

| `--provider` | | Filter by provider: `openai`, `anthropic`, `meta`, `deepseek`, `alibaba`, `microsoft`, `all` |

| `--vocab-file` | | Path to SentencePiece `.model` file for exact Llama tokenization |

| `--all` | | Show all counting methods |

| `--json` | | JSON output |

| `--cost` | | Include cost estimates (per 1M tokens) |

| `--recursive` | `-r` | Recursively count files in a directory |

| `--directory` | `-d` | Alias for `--recursive` |

| `--chars-per-token` | | Character/token ratio for approximation (default: 4.0) |

| `--words-per-token` | | Words/token ratio for approximation (default: 0.75) |

| `--verbose` | | Show additional details |

| `--no-color` | | Disable color output |

## Examples

### Single model

```

$ tcount --model gpt-5 document.md

Token Count Report for: document.md

═══════════════════════════════════════════════════════

Basic Statistics:

  Characters:     5451

  Words:          662

  Lines:          222

Token Counts by Method:

  ┌─────────────────────────┬──────────┬────────────┬──────────────────┐

  │ Method                  │ Tokens   │ Accuracy   │ Context Usage    │

  ├─────────────────────────┼──────────┼────────────┼──────────────────┤

  │ GPT (gpt-5)             │ 1445     │ Exact      │ 0.7% of 200K     │

  └─────────────────────────┴──────────┴────────────┴──────────────────┘

```

### All methods with costs

```

$ tcount --all --cost document.md

Token Count Report for: document.md

═══════════════════════════════════════════════════════

Basic Statistics:

  Characters:     5451

  Words:          662

  Lines:          222

Token Counts by Method:

  ┌─────────────────────────┬──────────┬────────────┬──────────────────┐

  │ Method                  │ Tokens   │ Accuracy   │ Context Usage    │

  ├─────────────────────────┼──────────┼────────────┼──────────────────┤

  │ GPT (gpt-5)             │ 1445     │ Exact      │ 0.7% of 200K     │

  │ GPT (gpt-4o)            │ 1445     │ Exact      │ 1.1% of 128K     │

  │ Claude (approx)         │ 1434     │ Estimated  │ 0.7% of 200K     │

  │ Llama (llama-3.1-8b)    │ 1445     │ Exact      │ 1.1% of 128K     │

  │ Character-based (÷4.0)  │ 1362     │ Approx     │                  │

  │ Word-based (×1.33)      │ 882      │ Approx     │                  │

  │ Whitespace split        │ 662      │ Approx     │                  │

  └─────────────────────────┴──────────┴────────────┴──────────────────┘

Cost Estimates (Input):

  gpt-5:           $0.0018 ($1.25/1M tokens)

  gpt-4o:          $0.0036 ($2.50/1M tokens)

  claude-sonnet-4.6: $0.0043 ($3.00/1M tokens)

  claude-sonnet-4.5: $0.0043 ($3.00/1M tokens)

```

### SentencePiece for exact Llama tokenization

```bash

# Download tokenizer.model from HuggingFace (requires auth):

# https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/original/tokenizer.model

tcount --model llama-3.1-8b --vocab-file /path/to/tokenizer.model document.md

```

Without `--vocab-file`, Llama models use a tiktoken-based approximation.

### Directory scanning

```

$ tcount -r --verbose tokenizer/

Found 4 text files (skipped 0 binary, 0 ignored)

Token Count Report for: tokenizer/ (directory)

═══════════════════════════════════════════════════════

Basic Statistics:

  Files:          4

  Characters:     14929

  Words:          1906

  Lines:          612

Token Counts by Method:

  ┌─────────────────────────┬──────────┬────────────┬──────────────────┐

  │ Method                  │ Tokens   │ Accuracy   │ Context Usage    │

  ├─────────────────────────┼──────────┼────────────┼──────────────────┤

  │ GPT (gpt-5)             │ 4206     │ Exact      │ 2.1% of 200K     │

  │ Claude (approx)         │ 3928     │ Estimated  │ 2.0% of 200K     │

  │ Character-based (÷4.0)  │ 3732     │ Approx     │                  │

  │ Word-based (×1.33)      │ 2541     │ Approx     │                  │

  │ Whitespace split        │ 1906     │ Approx     │                  │

  └─────────────────────────┴──────────┴────────────┴──────────────────┘

```

When scanning directories, tcount respects `.gitignore` rules, skips binary files and `.git` directories, and aggregates all text files into a combined count. Use `--verbose` to see file and skip statistics.

### JSON output

```

$ tcount --json --model gpt-5 document.md

{

  "file_path": "document.md",

  "file_size": 5451,

  "characters": 5451,

  "words": 662,

  "lines": 222,

  "methods": [

    {

      "name": "tiktoken_gpt_5",

      "display_name": "GPT (gpt-5)",

      "tokens": 1445,

      "is_exact": true,

      "context_window": 200000

    }

  ]

}

```

```bash

# Extract a specific count

tcount --json myfile.txt | jq '.methods[] | select(.name == "tiktoken_gpt_5") | .tokens'

# Batch count all markdown files

for f in docs/*.md; do tcount --json "$f"; done | jq -s '.'

```

## Library Usage

tcount can be used as a Go library in your own projects.

### Installation

```bash

go get github.com/lancekrogers/tcount/tokenizer

```

### Basic Token Counting

```go

package main

import (

    "context"

    "fmt"

    "log"

    "github.com/lancekrogers/tcount/tokenizer"

)

func main() {

    counter, err := tokenizer.NewCounter(tokenizer.CounterOptions{})

    if err != nil {

        log.Fatal(err)

    }

    ctx := context.Background()

    result, err := counter.Count(ctx, "Hello, world!", "gpt-4o", false)

    if err != nil {

        log.Fatal(err)

    }

    for _, m := range result.Methods {

        if m.IsExact {

            fmt.Printf("Tokens: %d (exact, %s)\n", m.Tokens, m.DisplayName)

        }

    }

}

```

### File and Directory Counting

```go

ctx := context.Background()

// Count tokens in a single file

result, err := counter.CountFile(ctx, "document.md", "gpt-4o", false)

// Count tokens across a directory (respects .gitignore, skips binaries)

result, err := counter.CountDirectory(ctx, "./src", "", true)

fmt.Printf("Files: %d, Tokens: %d\n", result.FileCount, result.Methods[0].Tokens)

```

### Direct BPE Tokenizer Access

```go

tok, err := tokenizer.NewBPETokenizer("gpt-4o")

if err != nil {

    log.Fatal(err)

}

count, _ := tok.CountTokens("Hello, world!")

fmt.Printf("Tokens: %d, Exact: %v\n", count, tok.IsExact())

```

### Model Discovery

```go

// Get metadata for a specific model

meta := tokenizer.GetModelMetadata("gpt-4o")

fmt.Printf("Encoding: %s, Context: %d\n", meta.Encoding, meta.ContextWindow)

// List all registered models

models := tokenizer.ListModels()

// List models by provider

openaiModels := tokenizer.ListModelsByProvider(tokenizer.ProviderOpenAI)

```

### Cost Estimation

```go

ctx := context.Background()

result, _ := counter.Count(ctx, text, "gpt-4o", false)

costs := tokenizer.CalculateCosts(result.Methods)

for _, c := range costs {

    fmt.Printf("%s: $%.4f\n", c.Model, c.Cost)

}

```

## Development

Requires [just](https://github.com/casey/just) for the build system.

```bash

just                       # List all recipes

just build                 # Build (with fmt + vet)

just test all              # Run all tests

just test unit             # Unit tests only

just test integration      # Integration tests only

just test coverage         # Coverage report

just test bench            # Benchmarks

just release all            # Cross-compile for all platforms

```

## License

MIT License. See [LICENSE](LICENSE) for details.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lancekrogers/tcount

Awesome Lists containing this project

README