{"id":27646226,"url":"https://github.com/willkirkmanm/byte-pair-encoding","last_synced_at":"2025-04-24T01:17:40.345Z","repository":{"id":289151681,"uuid":"970208621","full_name":"WillKirkmanM/byte-pair-encoding","owner":"WillKirkmanM","description":"The Large Language Model Tokenizer Algorithm","archived":false,"fork":false,"pushed_at":"2025-04-21T19:28:43.000Z","size":2003,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-21T20:37:29.788Z","etag":null,"topics":["byte-pair-encoding","parsonlabs"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WillKirkmanM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-21T16:48:13.000Z","updated_at":"2025-04-21T19:28:47.000Z","dependencies_parsed_at":"2025-04-21T20:49:13.656Z","dependency_job_id":null,"html_url":"https://github.com/WillKirkmanM/byte-pair-encoding","commit_stats":null,"previous_names":["willkirkmanm/byte-pair-encoding"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WillKirkmanM%2Fbyte-pair-encoding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WillKirkmanM%2Fbyte-pair-encoding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WillKirkmanM%2Fbyte-pair-encoding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WillKirkmanM%2Fbyte-pair-encoding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WillKirkmanM","download_url":"https://codeload.github.com/WillKirkmanM/byte-pair-encoding/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250540909,"owners_count":21447428,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["byte-pair-encoding","parsonlabs"],"created_at":"2025-04-24T01:17:39.824Z","updated_at":"2025-04-24T01:17:40.337Z","avatar_url":"https://github.com/WillKirkmanM.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"https://avatars.githubusercontent.com/u/138057124?s=200\u0026v=4\" width=\"150\" /\u003e\r\n\u003c/p\u003e\r\n\u003ch1 align=\"center\"\u003eByte Pair Encoding\u003c/h1\u003e\r\n\r\n\u003cp align=\"center\"\u003eThe Large Language Model Tokenizer Algorithm\u003c/p\u003e\r\n\r\n\r\nThis project provides a command-line tool implemented in C++ for training a Byte Pair Encoding (BPE) model on a text corpus and encoding text using the learned model.\r\n\r\n## How BPE Works\r\n\r\nByte Pair Encoding is a data compression technique that is commonly used in Natural Language Processing (NLP) for tokenisation. It helps manage large vocabularies and handle unknown words.\r\n\r\nHere's a simplified overview of the BPE **training** process:\r\n\r\n1.  **Initialisation**:\r\n    *   Start with a vocabulary consisting of all individual characters (or bytes) present in the training corpus.\r\n    *   Represent the corpus as a sequence of these initial character/byte tokens.\r\n\r\n2.  **Iteration**:\r\n    *   Count the frequency of all adjacent pairs of tokens in the current sequence.\r\n    *   Identify the most frequent pair (e.g., 't' followed by 'h').\r\n    *   **Merge** this most frequent pair into a single new token (e.g., 'th').\r\n    *   Add this new token to the vocabulary.\r\n    *   Replace all occurrences of the original pair in the sequence with the new merged token.\r\n\r\n3.  **Repeat**:\r\n    *   Repeat the iteration step (counting, finding the most frequent pair, merging) for a predetermined number of merges or until the desired vocabulary size is reached.\r\n\r\nThe result of training is:\r\n*   A **vocabulary** containing the initial characters/bytes and the new merged tokens.\r\n*   An ordered list of **merge rules** indicating which pairs were merged to create which new tokens.\r\n\r\n**Encoding** new text involves:\r\n1.  Splitting the text into its initial character/byte sequence.\r\n2.  Applying the learned merge rules *in the same order they were learned during training* to the sequence until no more merges can be applied.\r\n3.  The final sequence of tokens (original characters/bytes and merged tokens) is the BPE-encoded representation.\r\n\r\n## Building the Tool\r\n\r\nThis project uses CMake to generate build files for various build systems like Make and Ninja.\r\n\r\n**Prerequisites:**\r\n*   A C++17 compliant compiler (like g++, Clang, or MSVC)\r\n*   CMake (version 3.10 or higher)\r\n*   A build tool (like `make` or `ninja`)\r\n\r\n**Steps:**\r\n\r\n1.  **Clone/Download:** Get the project files.\r\n2.  **Create Build Directory:**\r\n    ```bash\r\n    cd byte-pair-encoding\r\n    mkdir build\r\n    cd build\r\n    ```\r\n3.  **Configure with CMake:**\r\n    *   **For Makefiles (Default on many Linux/macOS systems):**\r\n        ```bash\r\n        cmake ..\r\n        ```\r\n    *   **For Ninja (Often faster):**\r\n        ```bash\r\n        cmake -G Ninja ..\r\n        ```\r\n    *   **For Visual Studio (Windows):** Open the folder in Visual Studio, or use CMake GUI, or run from a Developer Command Prompt:\r\n        ```cmd\r\n        # Example for VS 2019\r\n        cmake -G \"Visual Studio 16 2019\" -A x64 ..\r\n        ```\r\n4.  **Build:**\r\n    *   **If using Make:**\r\n        ```bash\r\n        make\r\n        ```\r\n    *   **If using Ninja:**\r\n        ```bash\r\n        ninja\r\n        ```\r\n    *   **If using Visual Studio:** Build the solution (`bpe_tool.sln`) within the IDE or use MSBuild:\r\n        ```cmd\r\n        msbuild bpe_tool.sln /property:Configuration=Release\r\n        ```\r\n\r\nThe executable `bpe_tool` (or `bpe_tool.exe` on Windows) will be created in the `build` directory.\r\n\r\n## Running the Tool\r\n\r\nPlace your input text file (e.g., `shakespeare.txt`) in the main project directory. Run the tool from the `build` directory or copy the executable elsewhere.\r\n\r\n**1. Train a BPE Model:**\r\n\r\n```bash\r\n# Usage: ./bpe_tool train \u003cinput_file\u003e \u003cvocab_size\u003e \u003coutput_merges_file\u003e\r\n./bpe_tool train ../shakespeare.txt 1000 ../shakespeare.merges\r\n```\r\n*   `\u003cinput_file\u003e`: Path to the training text (e.g., `../shakespeare.txt`).\r\n*   `\u003cvocab_size\u003e`: The target total vocabulary size (initial 256 bytes + number of merges). Must be \u003e 256. Example: `1000`.\r\n*   `\u003coutput_merges_file\u003e`: Path where the learned merge rules will be saved (e.g., `../shakespeare.merges`).\r\n\r\n**2. Encode Text using a Trained Model:**\r\n\r\n```bash\r\n# Usage: ./bpe_tool encode \u003cinput_file\u003e \u003cmerges_file\u003e\r\n./bpe_tool encode ../my_text_to_encode.txt ../shakespeare.merges\r\n```\r\n*   `\u003cinput_file\u003e`: Path to the text you want to encode (e.g., `../my_text_to_encode.txt`). Create this file with some sample text.\r\n*   `\u003cmerges_file\u003e`: Path to the merge rules file created during training (e.g., `../shakespeare.merges`).\r\n\r\nThe tool will print the resulting sequence of token IDs to the console.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwillkirkmanm%2Fbyte-pair-encoding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwillkirkmanm%2Fbyte-pair-encoding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwillkirkmanm%2Fbyte-pair-encoding/lists"}