{"id":26590455,"url":"https://github.com/knuddelsgmbh/jtokkit","last_synced_at":"2025-03-23T13:39:47.478Z","repository":{"id":149988362,"uuid":"616182479","full_name":"knuddelsgmbh/jtokkit","owner":"knuddelsgmbh","description":"JTokkit is a Java tokenizer library designed for use with OpenAI models.","archived":false,"fork":false,"pushed_at":"2025-03-14T15:45:30.000Z","size":4577,"stargazers_count":629,"open_issues_count":13,"forks_count":44,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-14T16:37:08.288Z","etag":null,"topics":["java","openai"],"latest_commit_sha":null,"homepage":"https://jtokkit.knuddels.de/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/knuddelsgmbh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-19T20:52:37.000Z","updated_at":"2025-03-13T15:23:48.000Z","dependencies_parsed_at":"2023-04-15T16:03:11.342Z","dependency_job_id":"43fe229c-212e-4e24-9372-3fde09a410ba","html_url":"https://github.com/knuddelsgmbh/jtokkit","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knuddelsgmbh%2Fjtokkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knuddelsgmbh%2Fjtokkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knuddelsgmbh%2Fjtokkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/knuddelsgmbh%2Fjtokkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/knuddelsgmbh","download_url":"https://codeload.github.com/knuddelsgmbh/jtokkit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245111325,"owners_count":20562508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","openai"],"created_at":"2025-03-23T13:39:45.624Z","updated_at":"2025-03-23T13:39:47.469Z","avatar_url":"https://github.com/knuddelsgmbh.png","language":"Java","funding_links":[],"categories":["Utility","人工智能","Calculators and Estimators"],"sub_categories":["Tokenizers"],"readme":"# 🚀 JTokkit - Java Tokenizer Kit\n\n[![License: MIT](https://img.shields.io/github/license/knuddelsgmbh/jtokkit)](https://opensource.org/license/mit/)\n![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/knuddelsgmbh/jtokkit/build-publish.yml)\n![Maven Central](https://img.shields.io/maven-central/v/com.knuddels/jtokkit)\n[![javadoc](https://javadoc.io/badge2/com.knuddels/jtokkit/javadoc.svg)](https://javadoc.io/doc/com.knuddels/jtokkit)\n\nWelcome to JTokkit, a Java tokenizer library designed for use with OpenAI models.\n```java\nEncodingRegistry registry = Encodings.newDefaultEncodingRegistry();\nEncoding enc = registry.getEncoding(EncodingType.CL100K_BASE);\nassertEquals(\"hello world\", enc.decode(enc.encode(\"hello world\")));\n\n// Or get the tokenizer corresponding to a specific OpenAI model\nenc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);\n```\n\n## 💡 Quickstart\n\nFor a quick getting started, see our [documentation](https://jtokkit.knuddels.de/).\n\n## 📖 Introduction\nJTokkit aims to be a fast and efficient tokenizer designed for use in natural\nlanguage processing tasks using the OpenAI models. It provides an easy-to-use\ninterface for tokenizing input text, for example for counting required tokens\nin preparation of requests to the GPT-3.5 model. This library resulted out of\nthe need to have similar capacities in the JVM ecosystem as the library\n[tiktoken](https://github.com/openai/tiktoken) provides for Python.\n\n## 🤖 Features\n\n✅ Implements encoding and decoding via `r50k_base`, `p50k_base`, `p50k_edit`,\n`cl100k_base` and `o200k_base`\n\n✅ Easy-to-use API\n\n✅ Easy extensibility for custom encoding algorithms\n\n✅ **Zero** Dependencies\n\n✅ Supports Java 8 and above\n\n✅ Fast and efficient performance\n\n## 📊 Performance\n\nJTokkit is between 2-3 times faster than a comparable tokenizer.\n\n![benchmark](benchmark/reports/benchmark.svg)\n\nFor details on the benchmark, see the [benchmark](benchmark) directory.\n\n## 🛠️ Installation\nYou can install JTokkit by adding the following dependency to your Maven project:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.knuddels\u003c/groupId\u003e\n    \u003cartifactId\u003ejtokkit\u003c/artifactId\u003e\n    \u003cversion\u003e1.1.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nOr alternatively using Gradle:\n\n```groovy\ndependencies {\n    implementation 'com.knuddels:jtokkit:1.1.0'\n}\n```\n\n## 🔰 Getting Started\nTo use JTokkit, simply create a new `EncodingRegistry` and use `getEncoding` to\nretrieve the encoding you want to use. You can then use the `encode` and\n`decode` methods to encode and decode text.\n\n```java\nEncodingRegistry registry = Encodings.newDefaultEncodingRegistry();\nEncoding enc = registry.getEncoding(EncodingType.CL100K_BASE);\nIntArrayList encoded = enc.encode(\"This is a sample sentence.\");\n// encoded = [2028, 374, 264, 6205, 11914, 13]\n        \nString decoded = enc.decode(encoded);\n// decoded = \"This is a sample sentence.\"\n\n// Or get the tokenizer based on the model type\nEncoding secondEnc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);\n// enc == secondEnc\n```\n\nThe `EncodingRegistry` and `Encoding` classes are thread-safe and can be freely\nshared among components.\n\n## ➰ Extending JTokkit\n\nYou may want to extend JTokkit to support custom encodings. To do so, you have two\noptions:\n\n1. Implement the `Encoding` interface and register it with the `EncodingRegistry`\n```java\nEncodingRegistry registry = Encodings.newDefaultEncodingRegistry();\nEncoding customEncoding = new CustomEncoding();\nregistry.registerEncoding(customEncoding);\n```\n2. Add new parameters for use with the existing BPE algorithm\n```java\nEncodingRegistry registry = Encodings.newDefaultEncodingRegistry();\nGptBytePairEncodingParams params = new GptBytePairEncodingParams(\n        \"custom-name\",\n        Pattern.compile(\"some custom pattern\"),\n        encodingMap,\n        specialTokenEncodingMap\n);\nregistry.registerGptBytePairEncoding(params);\n```\n\nAfterwards you can use the custom encodings alongside the default ones and access\nthem by using `registry.getEncoding(\"custom-name\")`. See the JavaDoc for more\ndetails.\n\n## 📄 License\nJTokkit is licensed under the MIT License. See the\n[LICENSE](https://github.com/knuddelsgmbh/jtokkit/blob/main/LICENSE) file\nfor more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknuddelsgmbh%2Fjtokkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fknuddelsgmbh%2Fjtokkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fknuddelsgmbh%2Fjtokkit/lists"}