Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/endlessreform/token-counter
`wc` for tokens, using HuggingFace Tokenizers in Rust
https://github.com/endlessreform/token-counter
Last synced: about 6 hours ago
JSON representation
`wc` for tokens, using HuggingFace Tokenizers in Rust
- Host: GitHub
- URL: https://github.com/endlessreform/token-counter
- Owner: EndlessReform
- License: mit
- Created: 2024-07-04T21:25:26.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2024-07-04T21:38:27.000Z (4 months ago)
- Last Synced: 2024-10-07T14:18:46.096Z (about 1 month ago)
- Language: Rust
- Size: 12.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tc - Token Count
`tc` is a CLI tool for counting tokens in text files, as a lightweight wrapper around the HuggingFace [Tokenizers](https://docs.rs/tokenizers/latest/tokenizers/) crate. It's like the Unix `wc` command, but for tokens instead of words.
## Features
- Count tokens in files or from stdin
- Support for multiple files and glob patterns
- Uses any tokenizer in HuggingFace Tokenizers## Installation
```
cargo install token-counter
```### Usage
Using default tokenizer ([cl100k](https://huggingface.co/DWDMaiMai/tiktoken_cl100k_base), the tokenizer for GPT-3.5 and GPT-4):
```
tc file1.md file2.md
```Using globs:
```
tc *.md
```Arguments:
- `-m`, `--model`: HuggingFace ID of the model for tokenizer (ex. `google-bert/bert-base-uncased`)