{"id":27147339,"url":"https://github.com/deltartificial/tokenizer","last_synced_at":"2025-04-10T01:13:18.656Z","repository":{"id":286498851,"uuid":"961580222","full_name":"deltartificial/tokenizer","owner":"deltartificial","description":"Compute long files token lengths for different LLM models, built in Rust.","archived":false,"fork":false,"pushed_at":"2025-04-06T20:27:43.000Z","size":45,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T01:13:14.962Z","etag":null,"topics":["context","llm","rust","token","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deltartificial.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-06T19:56:59.000Z","updated_at":"2025-04-06T21:36:25.000Z","dependencies_parsed_at":"2025-04-06T21:24:35.489Z","dependency_job_id":null,"html_url":"https://github.com/deltartificial/tokenizer","commit_stats":null,"previous_names":["deltartificial/tokenizer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deltartificial%2Ftokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deltartificial%2Ftokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deltartificial%2Ftokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deltartificial%2Ftokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deltartificial","download_url":"https://codeload.github.com/deltartificial/tokenizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137891,"owners_count":21053775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["context","llm","rust","token","tokenizer"],"created_at":"2025-04-08T11:25:51.338Z","updated_at":"2025-04-10T01:13:18.629Z","avatar_url":"https://github.com/deltartificial.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tokenizer\n\nA CLI tool to compute token lengths of various file types (txt, md, pdf, html) for different LLM models.\n\n## Features\n\n- Calculate token counts for various file types (Text, Markdown, PDF, HTML)\n- Support for multiple LLM models (configurable via config.json)\n- Display token usage as percentage of context window\n- Powered by HuggingFace tokenizers library\n\n## Installation\n\nClone the repository and build the project:\n\n```bash\ngit clone https://github.com/deltartificial/tokenizer.git\ncd tokenizer\ncargo build --release\n```\n\n## Usage\n\n```bash\n# Count tokens in a file using the default config.json\n./target/release/tokenizer count path/to/your/file.txt\n\n# Count tokens using a custom config file\n./target/release/tokenizer count path/to/your/file.txt -c custom-config.json\n\n# Count tokens using a specific tokenizer model\n./target/release/tokenizer count path/to/your/file.html -t roberta-base\n```\n\n## Configuration\n\nThe tool uses a `config.json` file to define models and their context lengths. The default file includes configurations for various models:\n\n```json\n{\n  \"models\": [\n    {\n      \"name\": \"gpt-3.5-turbo\",\n      \"context_length\": 16385,\n      \"encoding\": \"tiktoken\"\n    },\n    {\n      \"name\": \"gpt-4\",\n      \"context_length\": 8192,\n      \"encoding\": \"tiktoken\"\n    },\n    {\n      \"name\": \"bert-base\",\n      \"context_length\": 512,\n      \"encoding\": \"bert\"\n    },\n    ...\n  ]\n}\n```\n\nYou can customize this file to add or modify models as needed.\n\n## Tokenization\n\nThis tool uses HuggingFace's tokenizers library, which provides high-performance implementations of various tokenization algorithms. The default tokenizer used is BERT, but the architecture is designed to be easily extended to support different tokenizers.\n\n## Supported File Types\n\n- `.txt` - Plain text files\n- `.md` - Markdown files\n- `.pdf` - PDF documents (basic implementation)\n- `.html`/`.htm` - HTML files (tags are stripped for token counting)\n\n## Project Structure\n\nThe project follows a clean architecture approach:\n\n- `domain`: Core business logic and entities\n- `application`: Use cases that orchestrate the domain logic\n- `infrastructure`: External services implementation (file reading, tokenization)\n- `presentation`: User interface (CLI)\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeltartificial%2Ftokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeltartificial%2Ftokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeltartificial%2Ftokenizer/lists"}