https://github.com/hyunwoongko/gpt2-tokenizer-java
Java implementation of GPT2 tokenizer.
https://github.com/hyunwoongko/gpt2-tokenizer-java
Last synced: 10 months ago
JSON representation
Java implementation of GPT2 tokenizer.
- Host: GitHub
- URL: https://github.com/hyunwoongko/gpt2-tokenizer-java
- Owner: hyunwoongko
- License: apache-2.0
- Created: 2022-05-14T09:41:26.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2023-02-05T19:00:26.000Z (over 3 years ago)
- Last Synced: 2025-04-04T07:11:29.488Z (about 1 year ago)
- Language: Java
- Size: 636 KB
- Stars: 67
- Watchers: 1
- Forks: 17
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-java - GPT2 Tokenizer Java
README
# GPT2 Tokenizer Java
Java implementation of GPT2 tokenizer
## Requirements
Please install the following dependencies to use the library.
```
implementation 'com.google.api-client:google-api-client:1.32.2'
implementation 'org.apache.commons:commons-lang3:3.12.0'
implementation 'org.springframework.boot:spring-boot-starter-web'
testImplementation 'org.junit.jupiter:junit-jupiter-api:5.3.1'
testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.3.1'
```
## Add tokenizer files to resources directory
Please add `encoder.json` and `vocab.bpe` files to your project resources directory.
these files can be found [here](https://github.com/hyunwoongko/gpt2-tokenizer-java/tree/master/src/main/resources/tokenizers/gpt2).
## Usage
The following are simple examples of this library.
To check test code for this, refer to [here](https://github.com/hyunwoongko/gpt2-tokenizer-java/blob/master/src/test/java/ai/tunib/tokenizer/GPT2TokenizerTest.java).
### Encoding text to tokens
```java
import ai.tunib.tokenizer.GPT2Tokenizer;
import java.util.List;
GPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained("PATH/IN/RESOURCES");
List result = tokenizer.encode("Hello my name is Kevin.");
```
```
[15496, 616, 1438, 318, 7939, 13]
```
### Decoding tokens to text
```java
import ai.tunib.tokenizer.GPT2Tokenizer;
GPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained("PATH/IN/RESOURCES");
String result = tokenizer.decode(List.of(15496, 616, 1438, 318, 7939, 13));
```
```
"Hello my name is Kevin."
```
## License
This project is licensed under the terms of the Apache License 2.0.
Copyright 2022 [Hyunwoong Ko](https://github.com/hyunwoongko). All Rights Reserved.