https://github.com/aallam/ktoken
Kotlin multiplatform BPE tokenizer library for OpenAI models
https://github.com/aallam/ktoken
binary-p bpe byte-pair-encoding gpt kotlin openai tiktoken tokenizer
Last synced: 9 months ago
JSON representation
Kotlin multiplatform BPE tokenizer library for OpenAI models
- Host: GitHub
- URL: https://github.com/aallam/ktoken
- Owner: aallam
- License: mit
- Created: 2023-10-04T19:48:37.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-27T20:46:50.000Z (over 1 year ago)
- Last Synced: 2025-05-05T21:11:43.083Z (about 1 year ago)
- Topics: binary-p, bpe, byte-pair-encoding, gpt, kotlin, openai, tiktoken, tokenizer
- Language: Kotlin
- Homepage:
- Size: 10.7 MB
- Stars: 32
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Ktoken
[](https://central.sonatype.com/namespace/com.aallam.ktoken)
[](LICENSE.md)
[](https://mouaad.aallam.com/ktoken/ktoken)
**Ktoken** is a BPE tokenizer designed for seamless integration with OpenAI's models.
## 📦 Setup
Install **Ktoken** by adding the dependency to your `build.gradle` file:
```groovy
repositories {
mavenCentral()
}
dependencies {
implementation "com.aallam.ktoken:ktoken:0.4.0"
}
```
## ⚡️ Getting Started
```kotlin
val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE)
// For a specific model in the OpenAI API:
val tokenizer = Tokenizer.of(model = "gpt-4")
val tokens = tokenizer.encode("hello world")
val text = tokenizer.decode(listOf(15339, 1917))
```
### ⚙️ Usage Modes
Ktoken operates in two modes: Local (default for JVM) and Remote (default for JS/Native).
#### 📍 Local Mode
Utilize `LocalPbeLoader` to retrieve encodings from local files:
```kotlin
val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = LocalPbeLoader(FileSystem.SYSTEM))
// For a specific model in the OpenAI API:
val tokenizer = Tokenizer.of(model = "gpt-4", loader = LocalPbeLoader(FileSystem.SYSTEM))
```
##### JVM Specifics:
Artifacts for JVM include encoding files. Use `FileSystem.RESOURCES` to load them:
```kotlin
val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = LocalPbeLoader(FileSystem.RESOURCES))
```
*Note: this is the default behavior for JVM.*
#### 🌐 Remote Mode
1. Add Engine: Include one of [Ktor's engines](https://ktor.io/docs/http-client-engines.html) to your dependencies.
2. Use `RemoteBpeLoader`: To load encoding from remote sources:
```kotlin
val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = RemoteBpeLoader())
// For a specific model in the OpenAI API:
val tokenizer = Tokenizer.of(model = "gpt-4", loader = RemoteBpeLoader())
```
### 📋 BOM Usage
You might alternatively use [ktoken-bom](/ktoken-bom) by adding the following dependency to your `build.gradle` file:
```groovy
dependencies {
// Import Kotlin API client BOM
implementation platform('com.aallam.ktoken:ktoken-bom:0.4.0')
// Define dependencies without versions
implementation 'com.aallam.ktoken:ktoken'
runtimeOnly 'io.ktor:ktor-client-okhttp'
}
```
### 🔀 Multiplatform Projects
For multiplatform projects, add the **ktoken** dependency to `commonMain`, and select an [engine](https://ktor.io/docs/http-client-engines.html) for each target.
## 📄 License
Ktoken is open-source software and distributed under the [MIT license](LICENSE.md).
**This project is not affiliated with nor endorsed by OpenAI**.