https://github.com/nathom/token-efficiency
Measuring token efficiency across structured data formats
https://github.com/nathom/token-efficiency
Last synced: 8 months ago
JSON representation
Measuring token efficiency across structured data formats
- Host: GitHub
- URL: https://github.com/nathom/token-efficiency
- Owner: nathom
- Created: 2025-10-10T18:46:53.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-18T22:56:19.000Z (8 months ago)
- Last Synced: 2025-10-19T13:34:36.268Z (8 months ago)
- Language: Python
- Size: 943 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Code for "Comparing Structured Data Formats for LLMs"
Utilities for measuring how efficiently different tokenizers encode structured data and how legible those structures are to large language models.
## Prerequisites
- Install [uv](https://docs.astral.sh/uv/) and ensure it is on your `PATH`.
All subsequent Python invocations should go through `uv run`.
## Command-Line Interface
The CLI is exposed via the `token-efficiency` console script. View the available subcommands with:
```bash
uv run token-efficiency --help
```
### Generate Token Efficiency Data
Create a dataset that compares tokens-per-node across random data shapes, serialization formats, and tokenizers:
```bash
uv run token-efficiency generate \
--output data/token_efficiency.json \
--samples-per-size 4 \
--size 31 --size 63 --size 127
```
Key options:
- `--tokenizer NAME=repo[@revision]` adds or overrides tokenizer definitions.
- `--size N` repeats to target multiple node counts.
- `--force` discards cached results and regenerates everything.
Generated metadata and samples are stored under `data/` (with cached raw artifacts in `data/cache`).
### Plot Existing Results
Render static plots for both token efficiency and legibility datasets:
```bash
uv run token-efficiency plot \
--token-efficiency-data data/token_efficiency.json \
--legibility-data data/legibility.json \
--output-dir plots
```
This produces heatmaps and comparison bar charts under `plots/token_efficiency/` and `plots/legibility/`.
### Generate Data And Plot In One Step
If you already have legibility results on disk, run a full pipeline:
```bash
uv run token-efficiency generate-and-plot \
--legibility-data data/legibility.json \
--output-dir plots
```
### Run The Legibility Benchmark
Evaluate how well a model reproduces structured outputs that were generated with known node counts:
```bash
export OPENROUTER_API_KEY=...
uv run token-efficiency legibility \
--output data/legibility.json \
--model deepseek/deepseek-chat \
--num-trials 25
```
Additional environment variables:
- `OPENROUTER_HTTP_REFERER` (required by OpenRouter usage policy).
- `OPENROUTER_X_TITLE` (recommended to label your traffic).
CLI flags let you adjust input/output node targets, serialization formats, temperature, timeout, and whether to restrict generated data to terminal values (`--terminals-only`).
### Preview A Benchmark Prompt
Inspect the exact prompt sent to the evaluation model:
```bash
uv run token-efficiency sample-prompt 63 5 --format json_min
```
## Data Layout
- `data/token_efficiency.json`: Measurements aggregated by shape, format, tokenizer, and node count.
- `data/legibility.json`: Accuracy metrics returned from the benchmark runner.
- `plots/`: Exported PNG and SVG visualizations, separated into `token_efficiency/` and `legibility/` folders.
- `resources/system_words.txt`: Word list used to synthesize readable identifiers.