https://github.com/aniketpatidar/tokenizer
https://github.com/aniketpatidar/tokenizer
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/aniketpatidar/tokenizer
- Owner: aniketpatidar
- Created: 2025-08-12T15:13:12.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-08-12T18:17:28.000Z (about 2 months ago)
- Last Synced: 2025-08-12T20:22:16.714Z (about 2 months ago)
- Language: JavaScript
- Size: 1.95 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# tinyTokenizer()
A tiny, character-level tokenizer in js.
## Features
- Deterministic special tokens: =0, =1, =2, =3
- Simple vocab builder: `buildVocab(text)` adds all characters present in `text`
- encode/decode are straightforward and predictable## Requirements
- Node.js
## Usage
### 1) Build vocab from a corpus
Create `corpus.txt`, then run:
```bash
node tokenizer.js train corpus.txt vocab.json
```Example output:
```
Trained. Vocab size: 128. Saved -> vocab.json
```### 2) Encode text (using saved vocab)
```bash
node tokenizer.js encode vocab.json "hello world"
```Example output (IDs are illustrative):
```
12,5,7,7,11,1,24,11,14,7,3
```### 3) Decode ids
```bash
node tokenizer.js decode vocab.json "12,5,7,7,11"
```Example output:
```
hello
```