{"id":16947879,"url":"https://github.com/hyunwoongko/gpt2-tokenizer-java","last_synced_at":"2025-07-09T17:08:20.357Z","repository":{"id":112252411,"uuid":"492172520","full_name":"hyunwoongko/gpt2-tokenizer-java","owner":"hyunwoongko","description":"Java implementation of GPT2 tokenizer.","archived":false,"fork":false,"pushed_at":"2023-02-05T19:00:26.000Z","size":651,"stargazers_count":67,"open_issues_count":7,"forks_count":17,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-04T07:11:29.488Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyunwoongko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-14T09:41:26.000Z","updated_at":"2025-02-06T07:52:53.000Z","dependencies_parsed_at":"2023-05-12T00:45:17.824Z","dependency_job_id":null,"html_url":"https://github.com/hyunwoongko/gpt2-tokenizer-java","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hyunwoongko/gpt2-tokenizer-java","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Fgpt2-tokenizer-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Fgpt2-tokenizer-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Fgpt2-tokenizer-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Fgpt2-tokenizer-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyunwoongko","download_url":"https://codeload.github.com/hyunwoongko/gpt2-tokenizer-java/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyunwoongko%2Fgpt2-tokenizer-java/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264502167,"owners_count":23618557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T21:48:46.769Z","updated_at":"2025-07-09T17:08:20.291Z","avatar_url":"https://github.com/hyunwoongko.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":["自然语言处理"],"readme":"# GPT2 Tokenizer Java\nJava implementation of GPT2 tokenizer\n\n## Requirements\nPlease install the following dependencies to use the library.\n\n```\nimplementation 'com.google.api-client:google-api-client:1.32.2'\nimplementation 'org.apache.commons:commons-lang3:3.12.0'\nimplementation 'org.springframework.boot:spring-boot-starter-web'\n\ntestImplementation 'org.junit.jupiter:junit-jupiter-api:5.3.1'\ntestRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.3.1'\n```\n\n## Add tokenizer files to resources directory\nPlease add `encoder.json` and `vocab.bpe` files to your project resources directory.\nthese files can be found [here](https://github.com/hyunwoongko/gpt2-tokenizer-java/tree/master/src/main/resources/tokenizers/gpt2).\n\n## Usage\nThe following are simple examples of this library.\nTo check test code for this, refer to [here](https://github.com/hyunwoongko/gpt2-tokenizer-java/blob/master/src/test/java/ai/tunib/tokenizer/GPT2TokenizerTest.java).\n\n### Encoding text to tokens\n```java\nimport ai.tunib.tokenizer.GPT2Tokenizer;\nimport java.util.List;\n\nGPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained(\"PATH/IN/RESOURCES\");\nList\u003cInteger\u003e result = tokenizer.encode(\"Hello my name is Kevin.\");\n```\n```\n[15496, 616, 1438, 318, 7939, 13]\n```\n\n### Decoding tokens to text\n```java\nimport ai.tunib.tokenizer.GPT2Tokenizer;\n\nGPT2Tokenizer tokenizer = GPT2Tokenizer.fromPretrained(\"PATH/IN/RESOURCES\");\nString result = tokenizer.decode(List.of(15496, 616, 1438, 318, 7939, 13));\n```\n```\n\"Hello my name is Kevin.\"\n```\n\n## License\n\nThis project is licensed under the terms of the Apache License 2.0.\n\nCopyright 2022 [Hyunwoong Ko](https://github.com/hyunwoongko). All Rights Reserved.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyunwoongko%2Fgpt2-tokenizer-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyunwoongko%2Fgpt2-tokenizer-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyunwoongko%2Fgpt2-tokenizer-java/lists"}