{"id":39925081,"url":"https://github.com/jserv/cjk-token-reducer","last_synced_at":"2026-01-22T21:02:10.872Z","repository":{"id":333121326,"uuid":"1136268551","full_name":"jserv/cjk-token-reducer","owner":"jserv","description":"Reduce Claude Code token usage by 35-50% when using CJK (Chinese, Japanese, and Korean)","archived":false,"fork":false,"pushed_at":"2026-01-18T06:25:22.000Z","size":109,"stargazers_count":9,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-22T06:01:54.204Z","etag":null,"topics":["cjk-tokenizer","claude-code","llm-inference"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jserv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-17T11:30:57.000Z","updated_at":"2026-01-20T08:07:17.000Z","dependencies_parsed_at":"2026-01-21T20:01:15.403Z","dependency_job_id":null,"html_url":"https://github.com/jserv/cjk-token-reducer","commit_stats":null,"previous_names":["jserv/cjk-token-reducer"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/jserv/cjk-token-reducer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jserv%2Fcjk-token-reducer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jserv%2Fcjk-token-reducer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jserv%2Fcjk-token-reducer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jserv%2Fcjk-token-reducer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jserv","download_url":"https://codeload.github.com/jserv/cjk-token-reducer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jserv%2Fcjk-token-reducer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28671179,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T20:48:19.482Z","status":"ssl_error","status_checked_at":"2026-01-22T20:48:14.968Z","response_time":144,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cjk-tokenizer","claude-code","llm-inference"],"created_at":"2026-01-18T17:37:24.076Z","updated_at":"2026-01-22T21:02:10.818Z","avatar_url":"https://github.com/jserv.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cjk-token-reducer\nReduce Claude Code token usage by 35-50% when using CJK languages.\n\n## The Problem\nCJK (Chinese, Japanese, Korean) languages consume 2-4x more tokens than English for the same semantic content.\nThis discrepancy leads to higher costs, faster context exhaustion, and reduced context windows for RAG/agent workflows.\n\n| Language | Avg Token Ratio | Typical Range | Notes |\n|----------|-----------------|---------------|-------|\n| Chinese | ~2.0-3.0x | 1.5-4.0x | Rare characters may split into 3-4 tokens |\n| Japanese | ~2.12x | 1.5-8.0x | Mixed Kanji/Kana creates segmentation challenges |\n| Korean | ~2.36x | 2.0-3.0x | Agglutinative nature compounds inefficiency |\n\n*Token ratios based on BPE tokenizer analysis. Actual savings depend on text complexity and technical term density.*\n\n### Why Does This Happen?\nThe inefficiency stems from the mechanics of Byte-Pair Encoding (BPE) and training data distribution:\n1. Vocabulary Bias: Modern tokenizers train primarily on English corpora.\n   Common English words merge into single tokens.\n   CJK characters, occurring less frequently in training data,\n   often fail to merge into \"words\" and split into individual character tokens or raw bytes.\n2. UTF-8 Byte Fallback: A common cause of token expansion.\n   - Many LLM tokenizers process text as UTF-8 bytes.\n   - An English character is 1 byte.\n   - A CJK character is typically 3 bytes in UTF-8.\n   - If a CJK character is absent from the tokenizer's vocabulary,\n     byte-level tokenizers may expand it into multiple tokens.\n     The exact expansion depends on the tokenizer's merge rules.\n3. Lack of Delimiters: English uses spaces as natural word boundaries,\n   aiding the tokenizer in identifying mergeable units.\n   CJK languages lack these delimiters,\n   forcing the tokenizer to rely purely on statistical frequency, which is lower for CJK sequences.\n\nThe consequence: API billing and model context limits measure tokens, not meaning.\nWriting in CJK incurs a \"tax\" on both cost and memory.\n\n## The Solution\nCJK Token Reducer translates your CJK input to English before sending it to Claude.\nEnglish is the \"native language\" of LLM tokenizers, so translation acts as a compression layer.\n\n### Key Features\n- Reduces input token count by 35-50% (up to 2x effective context window)\n- Preserves code blocks, file paths, and URLs (not sent for translation)\n- Auto-detects English technical terms (camelCase, PascalCase, SCREAMING_SNAKE_CASE)\n- macOS: Uses Apple NaturalLanguage framework for intelligent named entity recognition\n- Caches translations locally to eliminate redundant API calls\n- Uses free Google Translate API (no API key required)\n- Sends only prompt text for translation; code artifacts stay local\n- Adds 100-300ms latency per translation\n\n### Trade-offs and Limitations\nThis tool implements a \"Translate-Compute-Translate\" (TCT) pattern.\nWhile effective, it has inherent trade-offs:\n\n| Aspect | Impact |\n|--------|--------|\n| Semantic Fidelity | Translation is lossy. Technical terms may shift meaning. Use `[[markers]]` to preserve critical terms. |\n| Cultural Nuance | High-context CJK expressions may lose nuance when converted to English. |\n| Latency | Adds 1-3 API calls. Suitable for async/batch workflows; less ideal for real-time chat. |\n| Back-translation | Output translated back to CJK may sound unnatural (\"translationese\"). |\n\nWhen NOT to use this tool:\n- Precision-critical applications (legal, medical) where nuance matters\n- Real-time chat requiring minimal latency\n- When using native CJK-optimized models (DeepSeek V3, Qwen 2.5) which have efficient CJK tokenizers\n\nMitigation strategies:\n- Use `[[term]]` markers to preserve technical terms from translation\n- Enable `englishTerms` detection to auto-preserve English words in CJK text\n- Create custom glossaries for domain-specific terminology (planned feature)\n\n## Installation\n\n### Option 1: Cargo Install (Recommended)\n```shell\n# Linux/Windows\ncargo install --git https://github.com/jserv/cjk-token-reducer\n\n# macOS (with NLP support)\ncargo install --git https://github.com/jserv/cjk-token-reducer --features macos-nlp\n```\n\n### Option 2: Build from Source\n```shell\ngit clone https://github.com/jserv/cjk-token-reducer\ncd cjk-token-reducer\n\n# Linux/Windows\ncargo build --release\n\n# macOS (with NLP support)\ncargo build --release --features macos-nlp\n\n# Install (builds if needed, installs binary, configures Claude hook)\nmake install\n\n# Uninstall (removes binary and hook)\nmake uninstall\n```\n\nIf you prefer manual installation:\n```shell\ncp target/release/cjk-token-reducer ~/.local/bin/\nexport PATH=\"$HOME/.local/bin:$PATH\"\n```\n\n## Setup\n\n### 1. Configure Claude Code Hook\nAdd the following to your Claude Code settings file (usually `~/.claude/settings.json`).\nThis hook intercepts your prompt before submission.\n\n```json\n{\n  \"hooks\": {\n    \"UserPromptSubmit\": [\n      {\n        \"hooks\": [\n          {\n            \"type\": \"command\",\n            \"command\": \"cjk-token-reducer\"\n          }\n        ]\n      }\n    ]\n  }\n}\n```\n\nThe tool accepts JSON input `{\"prompt\": \"...\"}` on stdin and outputs modified JSON.\n\n#### How It Works\nThe hook intercepts at `UserPromptSubmit`, translating CJK prompts before Claude processes them:\n\n```\n┌──────────────────────────────────────────────────────────────┐\n│                      Claude Code Session                     │\n├──────────────────────────────────────────────────────────────┤\n│  SessionStart ─────► User types prompt (CJK)                 │\n│                           │                                  │\n│                           ▼                                  │\n│              ┌────────────────────────────┐                  │\n│              │    UserPromptSubmit        │                  │\n│              │  ┌──────────────────────┐  │                  │\n│              │  │  cjk-token-reducer   │  │ ◄─ Intercept     │\n│              │  │  - Detect CJK        │  │                  │\n│              │  │  - Check cache       │  │                  │\n│              │  │  - Translate → EN    │  │                  │\n│              │  │  - Preserve code     │  │                  │\n│              │  └──────────────────────┘  │                  │\n│              └────────────────────────────┘                  │\n│                           │                                  │\n│                           ▼                                  │\n│                   Claude processes (English prompt)          │\n│                           │                                  │\n│                   ┌───────┴───────┐                          │\n│                   ▼               ▼                          │\n│              PreToolUse      (No tools)                      │\n│                   │               │                          │\n│                   ▼               │                          │\n│              Tool executes        │                          │\n│                   │               │                          │\n│                   ▼               │                          │\n│              PostToolUse          │                          │\n│                   │               │                          │\n│                   └───────┬───────┘                          │\n│                           ▼                                  │\n│                        Stop                                  │\n└──────────────────────────────────────────────────────────────┘\n```\n\n### 2. Configuration (Optional)\nCreate a `.cjk-token.json` file to customize behavior.\nThe tool searches these locations in order:\n\n1. Current directory: `./.cjk-token.json`\n2. Home directory: `~/.cjk-token.json`\n3. Config directory: `~/.config/cjk-token-reducer/.cjk-token.json`\n\n```json\n{\n  \"outputLanguage\": \"en\",\n  \"threshold\": 0.1,\n  \"enableStats\": true,\n  \"cache\": {\n    \"enabled\": true,\n    \"ttlDays\": 30,\n    \"maxSizeMb\": 10\n  },\n  \"preserve\": {\n    \"englishTerms\": true,\n    \"useNlp\": true\n  }\n}\n```\n\n#### Configuration Options\n| Option | Type | Default | Description |\n|--------|------|---------|-------------|\n| `outputLanguage` | string | `\"en\"` | Desired response language from Claude. See below. |\n| `threshold` | number | `0.1` | Ratio of CJK characters required to trigger translation (0.1 = 10%). |\n| `enableStats` | boolean | `true` | Track and save token usage statistics. |\n| `cache.enabled` | boolean | `true` | Enable translation caching to reduce API calls. |\n| `cache.ttlDays` | number | `30` | Cache entry time-to-live in days. |\n| `cache.maxSizeMb` | number | `10` | Maximum cache size in megabytes. |\n| `preserve.englishTerms` | boolean | `true` | Auto-detect and preserve English technical terms in CJK text. |\n| `preserve.useNlp` | boolean | `true` | Use macOS NLP for named entity detection (macOS only, falls back to regex). |\n\n#### Data Storage Locations\nThe tool stores translation cache and statistics in platform-specific directories:\n\n| Platform | Cache Directory | Statistics Directory |\n|----------|-----------------|---------------------|\n| Linux | `~/.cache/cjk-token-reducer/` | `~/.config/cjk-token-reducer/` |\n| macOS | `~/Library/Caches/cjk-token-reducer/` | `~/Library/Application Support/cjk-token-reducer/` |\n| Windows | `%LOCALAPPDATA%\\cjk-token-reducer\\` | `%APPDATA%\\cjk-token-reducer\\` |\n\nFiles within these directories:\n- `translations.db/` — sled embedded database for translation cache\n- `stats.json` — token usage statistics\n\n#### Output Language Settings\n- `\"en\"` (default): Claude responds in English.\n  This yields maximum token savings for both input and output.\n- `\"zh\"`, `\"ja\"`, `\"ko\"`: Instructs Claude to reply in the specified language.\n  Saves input tokens, but output remains in CJK and consumes more tokens than English output.\n\n#### Platform-Specific Features\n\n**macOS NLP Integration**\n\nOn macOS, the tool leverages Apple's NaturalLanguage framework for intelligent named entity recognition.\nThis provides ML-based detection of:\n\n| Entity Type | Examples | Benefit |\n|-------------|----------|---------|\n| Personal Names | Tim Cook, Elon Musk, Satya Nadella | Preserved without translation corruption |\n| Place Names | Silicon Valley, Tokyo, Seoul | Geographic terms stay intact |\n| Organization Names | Apple, Microsoft, Google | Company names remain recognizable |\n\nThe NLP detector also supports extended Latin characters (e.g., \"Rene\", \"Munchen\", \"Francois\")\nwhile correctly filtering out CJK names that should be translated.\n\n**Comparison: NLP vs Regex Detection**\n\n| Aspect | Regex (All Platforms) | NLP (macOS) |\n|--------|----------------------|-------------|\n| Technical Terms | camelCase, PascalCase, SNAKE_CASE | Same + named entities |\n| Proper Names | Only if capitalized patterns match | ML-based recognition |\n| Context Awareness | Pattern-based only | Semantic understanding |\n| Performance | Faster | ~10-50ms overhead per call |\n\nTo disable NLP and use regex-only detection on macOS:\n```json\n{\n  \"preserve\": {\n    \"useNlp\": false\n  }\n}\n```\n\n## Usage\nOnce installed and configured, use Claude Code normally.\n\n```shell\nclaude\n❯ 重構這個函式\n# Automatically translated to: \"Refactor this function\"\n\n❯ この関数をリファクタリングしてください\n# Automatically translated to: \"Please refactor this function\"\n\n❯ 이 함수 리팩토링 해줘\n# Automatically translated to: \"Refactor this function\"\n```\n\n### CLI Commands\n```shell\n# View token savings statistics\ncjk-token-reducer --stats\n\n# View cache statistics\ncjk-token-reducer --cache-stats\n\n# Clear translation cache\ncjk-token-reducer --clear-cache\n\n# Preview translation without sending (dry run)\ncjk-token-reducer --dry-run\n\n# Bypass cache for single translation\ncjk-token-reducer --no-cache\n```\n\n### Viewing Statistics\nTrack your token savings over time:\n\n```shell\ncjk-token-reducer --stats\n```\n\nOutput example:\n```text\n╔══════════════════════════════════════════════════════════╗\n║           CJK Token Reducer Statistics                   ║\n╠══════════════════════════════════════════════════════════╣\n║  Total Translations:            150                      ║\n║  Translation Tokens:           3200                      ║\n║  Estimated Saved:              8500                      ║\n╚══════════════════════════════════════════════════════════╝\n```\n\n## Privacy \u0026 Security\n- Translation Service: This tool uses the public Google Translate API.\n  Your text prompts are sent to Google's servers.\n- Code Security: The tool preserves code blocks and file paths locally,\n  preventing them from being sent to the translation service.\n- Data Handling: No data is stored by this tool other than local usage statistics (if enabled) and translation cache.\n\n## Development\n```shell\n# Build (Linux/Windows)\ncargo build\n\n# Build with NLP support (macOS only)\ncargo build --features macos-nlp\n\n# Run tests\ncargo test\n\n# Run tests with NLP (macOS only)\ncargo test --features macos-nlp\n\n# Build for release (macOS with NLP)\ncargo build --release --features macos-nlp\n```\n\n## Alternatives\nFor preserving original language while reducing tokens,\nconsider [LLMLingua](https://github.com/microsoft/LLMLingua) — Microsoft's perplexity-based compression toolkit.\n\nHow LLMLingua Works:\n1. Uses a small LM (GPT-2 or LLaMA-7B) to compute token perplexity\n2. Removes tokens with low information content (high predictability)\n3. Preserves critical tokens that carry semantic weight\n\nWhen to Choose LLMLingua:\n- You need to preserve the original CJK language in prompts\n- Working with very long contexts (RAG, document Q\u0026A)\n- Compression ratio is more important than perfect fidelity\n- Integrated workflows (LangChain, LlamaIndex, Prompt Flow)\n\nWhen to Choose cjk-token-reducer:\n- Daily Claude Code usage with CJK input\n- Simplicity over maximum compression\n- No additional model inference overhead\n- Code-heavy prompts (LLMLingua may corrupt code blocks)\n\n## License\n`cjk-token-reducer` is available under a permissive MIT-style license.\nUse of this source code is governed by a MIT license that can be found in the [LICENSE](LICENSE) file.\n\n## References\n* Petrov et al. (2023): [Language Model Tokenizers Introduce Unfairness Between Languages](https://arxiv.org/abs/2305.15425) - Analysis of tokenization disparity, showing up to 15x inefficiency for some languages.\n* Ahia et al. (2023): [Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models](https://arxiv.org/abs/2305.13704) - Examines cost implications of tokenizer design on non-English languages.\n* Yennie Jun: [All Languages Are NOT Created (Tokenized) Equal](https://www.artfish.ai/p/all-languages-are-not-created-tokenized) - Visualizations and statistics on cross-lingual tokenization efficiency.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjserv%2Fcjk-token-reducer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjserv%2Fcjk-token-reducer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjserv%2Fcjk-token-reducer/lists"}