{"id":51038879,"url":"https://github.com/hectorifc/tessera","last_synced_at":"2026-06-22T09:01:04.544Z","repository":{"id":359138947,"uuid":"1244096788","full_name":"HectorIFC/tessera","owner":"HectorIFC","description":"Tessera - A byte-level BPE tokenizer library in pure Kotlin. Educational, dependency-free, GPT-4 style pre-tokenization.","archived":false,"fork":false,"pushed_at":"2026-05-20T19:29:06.000Z","size":262,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-20T19:36:25.956Z","etag":null,"topics":["bpe","byte-pair-encoding","educational","from-scratch","gpt","gradle","jvm","kotlin","kotlin-jvm","kotlin-library","llm","machine-learning","natural-language-processing","nlp","tokenization","tokenizer"],"latest_commit_sha":null,"homepage":"https://hectorifc.github.io/tessera/","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HectorIFC.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-20T00:51:39.000Z","updated_at":"2026-05-20T19:29:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/HectorIFC/tessera","commit_stats":null,"previous_names":["hectorifc/tessera"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/HectorIFC/tessera","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HectorIFC%2Ftessera","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HectorIFC%2Ftessera/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HectorIFC%2Ftessera/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HectorIFC%2Ftessera/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HectorIFC","download_url":"https://codeload.github.com/HectorIFC/tessera/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HectorIFC%2Ftessera/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34641636,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-22T02:00:06.391Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","byte-pair-encoding","educational","from-scratch","gpt","gradle","jvm","kotlin","kotlin-jvm","kotlin-library","llm","machine-learning","natural-language-processing","nlp","tokenization","tokenizer"],"created_at":"2026-06-22T09:01:03.336Z","updated_at":"2026-06-22T09:01:04.530Z","avatar_url":"https://github.com/HectorIFC.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/logo/logo-512.png\" alt=\"Tessera\" width=\"160\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eTessera\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  A byte-level BPE tokenizer \u003cstrong\u003elibrary\u003c/strong\u003e in pure Kotlin.\u003cbr/\u003e\n  \u003cem\u003eTessera\u003c/em\u003e — from Latin, a piece of mosaic. Each token is a tessera; together they form the mosaic of language.\n\u003c/p\u003e\n\n## Status\n\nSee [ARCHITECTURE.md](./ARCHITECTURE.md) for internals and [BENCHMARKS.md](./BENCHMARKS.md) for test results.\n\n## About\n\nTessera is a **Kotlin library** that implements a byte-level **Byte-Pair Encoding** (BPE) tokenizer in the style of GPT-4's `cl100k_base`. Built from scratch in **pure Kotlin**, with no ML framework dependencies, it is designed for developers who want to understand how modern tokenizers work and for Kotlin/JVM projects that need a lean, readable tokenization library.\n\n### Principles\n\n- **Library, not application** — designed to be consumed by other Kotlin projects\n- **Pure Kotlin** — no DJL, no KInference, no ML frameworks\n- **Standard library only** for tokenization logic\n- **Byte-level** — base vocabulary of 256 bytes, supports any UTF-8 input\n- **GPT-4 compatible approach** — pre-tokenization with `cl100k_base` regex\n- **Minimal public API** — only what is necessary, explicitly marked\n\n## Installation (after v0.0.1 release)\n\n### Gradle (Kotlin DSL)\n\n```kotlin\n// settings.gradle.kts\ndependencyResolutionManagement {\n    repositories {\n        maven { url = uri(\"https://jitpack.io\") }\n    }\n}\n\n// build.gradle.kts\ndependencies {\n    implementation(\"com.github.HectorIFC:tessera:v0.0.7\")\n}\n```\n\n## Quick start\n\n```kotlin\nimport dev.tessera.BpeTokenizer\nimport dev.tessera.Trainer\nimport dev.tessera.TrainingConfig\n\nfun main() {\n    // 1. Train a tokenizer from a corpus\n    val tokenizer = Trainer(TrainingConfig(numMerges = 5000))\n        .trainFromFile(\"corpus/text.txt\")\n\n    // 2. Save for later reuse\n    tokenizer.save(\"tessera.json\")\n\n    // 3. Load and use\n    val loaded = BpeTokenizer.load(\"tessera.json\")\n    val ids = loaded.encode(\"Hello, world!\")\n    val text = loaded.decode(ids)\n    println(\"$ids → $text\")\n}\n```\n\nMore examples in the [`tessera-samples`](./tessera-samples/) module.\n\n## Project structure\n\nThis is a **Gradle multi-module** project:\n\n```\ntessera/\n├── tessera-core/      ← The library (published artifact)\n├── tessera-cli/       ← CLI application consuming the library\n└── tessera-samples/   ← Usage examples\n```\n\n- **`tessera-core`**: the consumable JAR. Minimal public API, no runtime dependencies beyond Kotlin stdlib and kotlinx-serialization.\n- **`tessera-cli`**: runnable application (`./gradlew :tessera-cli:run`) demonstrating the library in use.\n- **`tessera-samples`**: small Kotlin programs with `main()` showing usage patterns.\n\n## Running locally\n\n```bash\n# Build everything\n./gradlew build\n\n# Run tests\n./gradlew test\n\n# Run the full quality pipeline\n./gradlew test koverVerify ktlintCheck detekt\n\n# Install the library in Maven Local for testing in other projects\n./gradlew publishToMavenLocal\n\n# Run the CLI\n./gradlew :tessera-cli:run --args=\"train --corpus corpus/text.txt --merges 5000 --output tessera.json\"\n\n# Run a sample\n./gradlew :tessera-samples:run -PmainClass=dev.tessera.samples.QuickStartSampleKt\n```\n\n## Architecture\n\nAt a high level:\n\n1. **Pre-tokenization**: text is split into logical chunks (words, contractions, numbers, punctuation) by a GPT-4-style regex.\n2. **Byte conversion**: each chunk becomes a sequence of UTF-8 bytes (0–255).\n3. **BPE**: the learned algorithm iteratively merges the most frequent byte pairs, building composite tokens.\n4. **Greedy encode**: at inference time, always apply the merge with the lowest rank (learned first), reproducing GPT behaviour.\n\nSee [ARCHITECTURE.md](./ARCHITECTURE.md) (created in Phase 5) for technical details.\n\n## Roadmap\n\n- [x] Define scope and architecture\n- [x] **Phase 0**: Gradle multi-module setup\n- [x] **Phase 1**: Core library with round-trip guarantee and stable public API\n- [x] **Phase 2**: Sample apps consuming the library\n- [x] **Phase 3**: CLI consuming the library\n- [x] **Phase 4**: Validation against tiktoken, fuzz tests, coverage ≥ 80%\n- [x] **Phase 5**: JitPack publication, ARCHITECTURE.md, full KDoc\n\n## Sister project\n\nOnce Tessera is complete, the next step is a **separate codebase** for embeddings that will consume `tessera-core` as a Gradle dependency.\n\n## License\n\nMIT — see [LICENSE](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhectorifc%2Ftessera","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhectorifc%2Ftessera","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhectorifc%2Ftessera/lists"}