https://github.com/aniketpatidar/tokenizer

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/aniketpatidar/tokenizer
Owner: aniketpatidar
Created: 2025-08-12T15:13:12.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2025-08-12T18:17:28.000Z (about 2 months ago)
Last Synced: 2025-08-12T20:22:16.714Z (about 2 months ago)
Language: JavaScript
Size: 1.95 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# tinyTokenizer()

A tiny, character-level tokenizer in js.

## Features

- Deterministic special tokens: =0, =1, =2, =3
- Simple vocab builder: `buildVocab(text)` adds all characters present in `text`
- encode/decode are straightforward and predictable

## Requirements

- Node.js

## Usage

### 1) Build vocab from a corpus

Create `corpus.txt`, then run:

```bash
node tokenizer.js train corpus.txt vocab.json
```

Example output:

```
Trained. Vocab size: 128. Saved -> vocab.json
```

### 2) Encode text (using saved vocab)

```bash
node tokenizer.js encode vocab.json "hello world"
```

Example output (IDs are illustrative):

```
12,5,7,7,11,1,24,11,14,7,3
```

### 3) Decode ids

```bash
node tokenizer.js decode vocab.json "12,5,7,7,11"
```

Example output:

```
hello
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aniketpatidar/tokenizer

Awesome Lists containing this project

README