https://github.com/aallam/ktoken

Kotlin multiplatform BPE tokenizer library for OpenAI models
https://github.com/aallam/ktoken

binary-p bpe byte-pair-encoding gpt kotlin openai tiktoken tokenizer

Last synced: 10 months ago
JSON representation

Kotlin multiplatform BPE tokenizer library for OpenAI models

Host: GitHub
URL: https://github.com/aallam/ktoken
Owner: aallam
License: mit
Created: 2023-10-04T19:48:37.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-01-27T20:46:50.000Z (over 1 year ago)
Last Synced: 2025-05-05T21:11:43.083Z (about 1 year ago)
Topics: binary-p, bpe, byte-pair-encoding, gpt, kotlin, openai, tiktoken, tokenizer
Language: Kotlin
Homepage:
Size: 10.7 MB
Stars: 32
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # Ktoken

[![Maven Central](https://img.shields.io/maven-central/v/com.aallam.ktoken/ktoken?color=blue&label=Download)](https://central.sonatype.com/namespace/com.aallam.ktoken)

[![License](https://img.shields.io/github/license/aallam/ktoken?color=yellow)](LICENSE.md)

[![Documentation](https://img.shields.io/badge/docs-api-a97bff.svg?logo=kotlin)](https://mouaad.aallam.com/ktoken/ktoken)

**Ktoken** is a BPE tokenizer designed for seamless integration with OpenAI's models.

## 📦 Setup

Install **Ktoken** by adding the dependency to your `build.gradle` file:

```groovy

repositories {

    mavenCentral()

}

dependencies {

    implementation "com.aallam.ktoken:ktoken:0.4.0"

}

```

## ⚡️ Getting Started

```kotlin

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE)

// For a specific model in the OpenAI API:

val tokenizer = Tokenizer.of(model = "gpt-4")

val tokens = tokenizer.encode("hello world")

val text = tokenizer.decode(listOf(15339, 1917))

```

### ⚙️ Usage Modes

Ktoken operates in two modes: Local (default for JVM) and Remote (default for JS/Native).

#### 📍 Local Mode

Utilize `LocalPbeLoader` to retrieve encodings from local files:

```kotlin

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = LocalPbeLoader(FileSystem.SYSTEM))

// For a specific model in the OpenAI API:

val tokenizer = Tokenizer.of(model = "gpt-4", loader = LocalPbeLoader(FileSystem.SYSTEM))

```

##### JVM Specifics:

Artifacts for JVM include encoding files. Use `FileSystem.RESOURCES` to load them:

```kotlin

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = LocalPbeLoader(FileSystem.RESOURCES))

```

*Note: this is the default behavior for JVM.*

#### 🌐 Remote Mode

1. Add Engine: Include one of [Ktor's engines](https://ktor.io/docs/http-client-engines.html) to your dependencies.

2. Use `RemoteBpeLoader`: To load encoding from remote sources:

```kotlin

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = RemoteBpeLoader())

// For a specific model in the OpenAI API:

val tokenizer = Tokenizer.of(model = "gpt-4", loader = RemoteBpeLoader())

```

### 📋 BOM Usage

You might alternatively use [ktoken-bom](/ktoken-bom) by adding the following dependency to your `build.gradle` file:

```groovy

dependencies {

    // Import Kotlin API client BOM

    implementation platform('com.aallam.ktoken:ktoken-bom:0.4.0')

    // Define dependencies without versions

    implementation 'com.aallam.ktoken:ktoken'

    runtimeOnly 'io.ktor:ktor-client-okhttp'

}

```

### 🔀 Multiplatform Projects

For multiplatform projects, add the **ktoken** dependency to `commonMain`, and select an [engine](https://ktor.io/docs/http-client-engines.html) for each target.

## 📄 License

Ktoken is open-source software and distributed under the [MIT license](LICENSE.md).

**This project is not affiliated with nor endorsed by OpenAI**.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aallam/ktoken

Awesome Lists containing this project

README