{"id":14109429,"url":"https://github.com/Gioni06/GPT3Tokenizer","last_synced_at":"2025-08-01T08:31:59.164Z","repository":{"id":65334321,"uuid":"589894587","full_name":"Gioni06/GPT3Tokenizer","owner":"Gioni06","description":"PHP package for Byte Pair Encoding (BPE) used by GPT-3","archived":false,"fork":false,"pushed_at":"2024-01-16T23:21:28.000Z","size":597,"stargazers_count":85,"open_issues_count":1,"forks_count":19,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-07-02T16:44:04.249Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gioni06.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-01-17T07:35:26.000Z","updated_at":"2025-06-18T22:11:36.000Z","dependencies_parsed_at":"2024-01-29T00:16:40.164Z","dependency_job_id":null,"html_url":"https://github.com/Gioni06/GPT3Tokenizer","commit_stats":{"total_commits":22,"total_committers":1,"mean_commits":22.0,"dds":0.0,"last_synced_commit":"ab9340ad822a1a018f3850e942e26c2cc4485d44"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/Gioni06/GPT3Tokenizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gioni06%2FGPT3Tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gioni06%2FGPT3Tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gioni06%2FGPT3Tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gioni06%2FGPT3Tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gioni06","download_url":"https://codeload.github.com/Gioni06/GPT3Tokenizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gioni06%2FGPT3Tokenizer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268192551,"owners_count":24210541,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-01T02:00:08.611Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-14T10:02:18.190Z","updated_at":"2025-08-01T08:31:58.773Z","avatar_url":"https://github.com/Gioni06.png","language":"PHP","funding_links":[],"categories":["LLMs \u0026 AI APIs","PHP"],"sub_categories":["Tokenizers \u0026 Prompt Utilities"],"readme":"# GPT3Tokenizer for PHP\n\nThis is a PHP port of the GPT-3 tokenizer. It is based on the [original Python implementation](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Tokenizer) and the [Nodejs implementation](https://github.com/latitudegames/GPT-3-Encoder).\n\nGPT-2 and GPT-3 use a technique called byte pair encoding to convert text into a sequence of integers, which are then used as input for the model.\nWhen you interact with the OpenAI API, you may find it useful to calculate the amount of tokens in a given text before sending it to the API.\n\nIf you want to learn more, read the [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) from Hugging Face.\n\n## tl;dr 🤖\n\nThere is a [Custom GPT for ChatGPT](https://chat.openai.com/g/g-e01hGLh44-gpt3tokenizer-guide) that can help you use this package in your software\n\n## Support ⭐️\n\nIf you find my work useful, I would be thrilled if you could show your support by giving this project a star ⭐️. \nIt only takes a second and it would mean a lot to me. Your star will not only make me feel warm and fuzzy inside, but it will also help reach more people who can benefit from this project.\n\n\n## Installation\nInstall the package from [Packagist](https://packagist.org/packages/gioni06/gpt3-tokenizer) using Composer:\n\n```bash\ncomposer require gioni06/gpt3-tokenizer\n```\n\n## Testing\nLoading the vocabulary files consumes a lot of memory. You might need to increase the phpunit memory limit.\nhttps://stackoverflow.com/questions/46448294/phpunit-coverage-allowed-memory-size-of-536870912-bytes-exhausted\n```bash\n./vendor/bin/phpunit -d memory_limit=-1 tests/\n```\n\n## Use the configuration Class\n\n```php\nuse Gioni06\\Gpt3Tokenizer\\Gpt3TokenizerConfig;\n\n// default vocab path\n// default merges path\n// caching enabled\n$defaultConfig = new Gpt3TokenizerConfig();\n\n$customConfig = new Gpt3TokenizerConfig();\n$customConfig\n    -\u003evocabPath('custom_vocab.json') // path to a custom vocabulary file\n    -\u003emergesPath('custom_merges.txt') // path to a custom merges file\n    -\u003euseCache(false)\n```\n\n### A note on caching\nThe tokenizer will try to use `apcu` for caching, if that is not available it will use a plain PHP `array`.\nYou will see slightly better performance for long texts when using the cache. The cache is enabled by default.\n\n## Encode a text\n\n```php\nuse Gioni06\\Gpt3Tokenizer\\Gpt3TokenizerConfig;\nuse Gioni06\\Gpt3Tokenizer\\Gpt3Tokenizer;\n\n$config = new Gpt3TokenizerConfig();\n$tokenizer = new Gpt3Tokenizer($config);\n$text = \"This is some text\";\n$tokens = $tokenizer-\u003eencode($text);\n// [1212,318,617,2420]\n```\n\n## Decode a text\n\n```php\nuse Gioni06\\Gpt3Tokenizer\\Gpt3TokenizerConfig;\nuse Gioni06\\Gpt3Tokenizer\\Gpt3Tokenizer;\n\n$config = new Gpt3TokenizerConfig();\n$tokenizer = new Gpt3Tokenizer($config);\n$tokens = [1212,318,617,2420]\n$text = $tokenizer-\u003edecode($tokens);\n// \"This is some text\"\n```\n\n## Count the number of tokens in a text\n\n```php\nuse Gioni06\\Gpt3Tokenizer\\Gpt3TokenizerConfig;\nuse Gioni06\\Gpt3Tokenizer\\Gpt3Tokenizer;\n\n$config = new Gpt3TokenizerConfig();\n$tokenizer = new Gpt3Tokenizer($config);\n$text = \"This is some text\";\n$numberOfTokens = $tokenizer-\u003ecount($text);\n// 4\n```\n\n## Encode a given text into chunks of tokens, with each chunk containing a specified maximum number of tokens.\n\nThis method is useful when handling large texts that need to be divided into smaller chunks for further processing.\n\n\n```php\nuse Gioni06\\Gpt3Tokenizer\\Gpt3TokenizerConfig;\nuse Gioni06\\Gpt3Tokenizer\\Gpt3Tokenizer;\n\n$config = new Gpt3TokenizerConfig();\n$tokenizer = new Gpt3Tokenizer($config);\n$text = \"1 2 hello，world 3 4\";\n$tokenizer-\u003eencodeInChunks($text, 5)\n// [[16, 362, 23748], [171, 120, 234, 6894, 513], [604]]\n```\n\n## Takes a given text and chunks it into encoded segments, with each segment containing a specified maximum number of tokens.\n\nThis method leverages the encodeInChunks method for encoding the text into Byte-Pair Encoded (BPE) tokens and then decodes these tokens back into text.\n\n```php\nuse Gioni06\\Gpt3Tokenizer\\Gpt3TokenizerConfig;\nuse Gioni06\\Gpt3Tokenizer\\Gpt3Tokenizer;\n\n$config = new Gpt3TokenizerConfig();\n$tokenizer = new Gpt3Tokenizer($config);\n$text = \"1 2 hello，world 3 4\";\n$tokenizer-\u003echunk($text, 5)\n// ['1 2 hello', '，world 3', ' 4']\n```\n\n## License\nThis project uses the Apache License 2.0 license. See the [LICENSE](LICENSE) file for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGioni06%2FGPT3Tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGioni06%2FGPT3Tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGioni06%2FGPT3Tokenizer/lists"}