https://github.com/knuddelsgmbh/jtokkit
JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://github.com/knuddelsgmbh/jtokkit
java openai
Last synced: about 1 year ago
JSON representation
JTokkit is a Java tokenizer library designed for use with OpenAI models.
- Host: GitHub
- URL: https://github.com/knuddelsgmbh/jtokkit
- Owner: knuddelsgmbh
- License: mit
- Created: 2023-03-19T20:52:37.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-03-14T15:45:30.000Z (over 1 year ago)
- Last Synced: 2025-03-14T16:37:08.288Z (over 1 year ago)
- Topics: java, openai
- Language: Java
- Homepage: https://jtokkit.knuddels.de/
- Size: 4.36 MB
- Stars: 629
- Watchers: 6
- Forks: 44
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
- awesome-java - JTokkit
- awesome-llm-cost - jtokkit - Java port of tiktoken. (Calculators and Estimators / Tokenizers)
README
# 🚀 JTokkit - Java Tokenizer Kit
[](https://opensource.org/license/mit/)


[](https://javadoc.io/doc/com.knuddels/jtokkit)
Welcome to JTokkit, a Java tokenizer library designed for use with OpenAI models.
```java
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
assertEquals("hello world", enc.decode(enc.encode("hello world")));
// Or get the tokenizer corresponding to a specific OpenAI model
enc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);
```
## 💡 Quickstart
For a quick getting started, see our [documentation](https://jtokkit.knuddels.de/).
## 📖 Introduction
JTokkit aims to be a fast and efficient tokenizer designed for use in natural
language processing tasks using the OpenAI models. It provides an easy-to-use
interface for tokenizing input text, for example for counting required tokens
in preparation of requests to the GPT-3.5 model. This library resulted out of
the need to have similar capacities in the JVM ecosystem as the library
[tiktoken](https://github.com/openai/tiktoken) provides for Python.
## 🤖 Features
✅ Implements encoding and decoding via `r50k_base`, `p50k_base`, `p50k_edit`,
`cl100k_base` and `o200k_base`
✅ Easy-to-use API
✅ Easy extensibility for custom encoding algorithms
✅ **Zero** Dependencies
✅ Supports Java 8 and above
✅ Fast and efficient performance
## 📊 Performance
JTokkit is between 2-3 times faster than a comparable tokenizer.

For details on the benchmark, see the [benchmark](benchmark) directory.
## 🛠️ Installation
You can install JTokkit by adding the following dependency to your Maven project:
```xml
com.knuddels
jtokkit
1.1.0
```
Or alternatively using Gradle:
```groovy
dependencies {
implementation 'com.knuddels:jtokkit:1.1.0'
}
```
## 🔰 Getting Started
To use JTokkit, simply create a new `EncodingRegistry` and use `getEncoding` to
retrieve the encoding you want to use. You can then use the `encode` and
`decode` methods to encode and decode text.
```java
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
IntArrayList encoded = enc.encode("This is a sample sentence.");
// encoded = [2028, 374, 264, 6205, 11914, 13]
String decoded = enc.decode(encoded);
// decoded = "This is a sample sentence."
// Or get the tokenizer based on the model type
Encoding secondEnc = registry.getEncodingForModel(ModelType.TEXT_EMBEDDING_ADA_002);
// enc == secondEnc
```
The `EncodingRegistry` and `Encoding` classes are thread-safe and can be freely
shared among components.
## ➰ Extending JTokkit
You may want to extend JTokkit to support custom encodings. To do so, you have two
options:
1. Implement the `Encoding` interface and register it with the `EncodingRegistry`
```java
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding customEncoding = new CustomEncoding();
registry.registerEncoding(customEncoding);
```
2. Add new parameters for use with the existing BPE algorithm
```java
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
GptBytePairEncodingParams params = new GptBytePairEncodingParams(
"custom-name",
Pattern.compile("some custom pattern"),
encodingMap,
specialTokenEncodingMap
);
registry.registerGptBytePairEncoding(params);
```
Afterwards you can use the custom encodings alongside the default ones and access
them by using `registry.getEncoding("custom-name")`. See the JavaDoc for more
details.
## 📄 License
JTokkit is licensed under the MIT License. See the
[LICENSE](https://github.com/knuddelsgmbh/jtokkit/blob/main/LICENSE) file
for more information.