https://github.com/hectorifc/tessera

Tessera - A byte-level BPE tokenizer library in pure Kotlin. Educational, dependency-free, GPT-4 style pre-tokenization.
https://github.com/hectorifc/tessera

bpe byte-pair-encoding educational from-scratch gpt gradle jvm kotlin kotlin-jvm kotlin-library llm machine-learning natural-language-processing nlp tokenization tokenizer

Last synced: 4 days ago
JSON representation

Tessera - A byte-level BPE tokenizer library in pure Kotlin. Educational, dependency-free, GPT-4 style pre-tokenization.

Host: GitHub
URL: https://github.com/hectorifc/tessera
Owner: HectorIFC
License: mit
Created: 2026-05-20T00:51:39.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-20T19:29:06.000Z (about 1 month ago)
Last Synced: 2026-05-20T19:36:25.956Z (about 1 month ago)
Topics: bpe, byte-pair-encoding, educational, from-scratch, gpt, gradle, jvm, kotlin, kotlin-jvm, kotlin-library, llm, machine-learning, natural-language-processing, nlp, tokenization, tokenizer
Language: Kotlin
Homepage: https://hectorifc.github.io/tessera/
Size: 256 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          


  



Tessera




  A byte-level BPE tokenizer library in pure Kotlin.


  Tessera — from Latin, a piece of mosaic. Each token is a tessera; together they form the mosaic of language.



## Status

See [ARCHITECTURE.md](./ARCHITECTURE.md) for internals and [BENCHMARKS.md](./BENCHMARKS.md) for test results.

## About

Tessera is a **Kotlin library** that implements a byte-level **Byte-Pair Encoding** (BPE) tokenizer in the style of GPT-4's `cl100k_base`. Built from scratch in **pure Kotlin**, with no ML framework dependencies, it is designed for developers who want to understand how modern tokenizers work and for Kotlin/JVM projects that need a lean, readable tokenization library.

### Principles

- **Library, not application** — designed to be consumed by other Kotlin projects

- **Pure Kotlin** — no DJL, no KInference, no ML frameworks

- **Standard library only** for tokenization logic

- **Byte-level** — base vocabulary of 256 bytes, supports any UTF-8 input

- **GPT-4 compatible approach** — pre-tokenization with `cl100k_base` regex

- **Minimal public API** — only what is necessary, explicitly marked

## Installation (after v0.0.1 release)

### Gradle (Kotlin DSL)

```kotlin

// settings.gradle.kts

dependencyResolutionManagement {

    repositories {

        maven { url = uri("https://jitpack.io") }

    }

}

// build.gradle.kts

dependencies {

    implementation("com.github.HectorIFC:tessera:v0.0.7")

}

```

## Quick start

```kotlin

import dev.tessera.BpeTokenizer

import dev.tessera.Trainer

import dev.tessera.TrainingConfig

fun main() {

    // 1. Train a tokenizer from a corpus

    val tokenizer = Trainer(TrainingConfig(numMerges = 5000))

        .trainFromFile("corpus/text.txt")

    // 2. Save for later reuse

    tokenizer.save("tessera.json")

    // 3. Load and use

    val loaded = BpeTokenizer.load("tessera.json")

    val ids = loaded.encode("Hello, world!")

    val text = loaded.decode(ids)

    println("$ids → $text")

}

```

More examples in the [`tessera-samples`](./tessera-samples/) module.

## Project structure

This is a **Gradle multi-module** project:

```

tessera/

├── tessera-core/      ← The library (published artifact)

├── tessera-cli/       ← CLI application consuming the library

└── tessera-samples/   ← Usage examples

```

- **`tessera-core`**: the consumable JAR. Minimal public API, no runtime dependencies beyond Kotlin stdlib and kotlinx-serialization.

- **`tessera-cli`**: runnable application (`./gradlew :tessera-cli:run`) demonstrating the library in use.

- **`tessera-samples`**: small Kotlin programs with `main()` showing usage patterns.

## Running locally

```bash

# Build everything

./gradlew build

# Run tests

./gradlew test

# Run the full quality pipeline

./gradlew test koverVerify ktlintCheck detekt

# Install the library in Maven Local for testing in other projects

./gradlew publishToMavenLocal

# Run the CLI

./gradlew :tessera-cli:run --args="train --corpus corpus/text.txt --merges 5000 --output tessera.json"

# Run a sample

./gradlew :tessera-samples:run -PmainClass=dev.tessera.samples.QuickStartSampleKt

```

## Architecture

At a high level:

1. **Pre-tokenization**: text is split into logical chunks (words, contractions, numbers, punctuation) by a GPT-4-style regex.

2. **Byte conversion**: each chunk becomes a sequence of UTF-8 bytes (0–255).

3. **BPE**: the learned algorithm iteratively merges the most frequent byte pairs, building composite tokens.

4. **Greedy encode**: at inference time, always apply the merge with the lowest rank (learned first), reproducing GPT behaviour.

See [ARCHITECTURE.md](./ARCHITECTURE.md) (created in Phase 5) for technical details.

## Roadmap

- [x] Define scope and architecture

- [x] **Phase 0**: Gradle multi-module setup

- [x] **Phase 1**: Core library with round-trip guarantee and stable public API

- [x] **Phase 2**: Sample apps consuming the library

- [x] **Phase 3**: CLI consuming the library

- [x] **Phase 4**: Validation against tiktoken, fuzz tests, coverage ≥ 80%

- [x] **Phase 5**: JitPack publication, ARCHITECTURE.md, full KDoc

## Sister project

Once Tessera is complete, the next step is a **separate codebase** for embeddings that will consume `tessera-core` as a Gradle dependency.

## License

MIT — see [LICENSE](./LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hectorifc/tessera

Awesome Lists containing this project

README

Tessera