{"id":13560748,"url":"https://github.com/VKCOM/YouTokenToMe","last_synced_at":"2025-04-03T16:31:04.763Z","repository":{"id":43974553,"uuid":"190571154","full_name":"VKCOM/YouTokenToMe","owner":"VKCOM","description":"Unsupervised text tokenizer focused on computational efficiency","archived":true,"fork":false,"pushed_at":"2024-03-29T10:21:35.000Z","size":197,"stargazers_count":965,"open_issues_count":41,"forks_count":103,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-03-19T17:18:03.847Z","etag":null,"topics":["bpe","natural-language-processing","nlp","tokenization","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VKCOM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-06T11:38:28.000Z","updated_at":"2025-03-16T20:44:16.000Z","dependencies_parsed_at":"2024-06-18T13:38:47.209Z","dependency_job_id":"3976671d-c251-435d-88df-5ec4330edf81","html_url":"https://github.com/VKCOM/YouTokenToMe","commit_stats":{"total_commits":58,"total_committers":7,"mean_commits":8.285714285714286,"dds":0.6551724137931034,"last_synced_commit":"f4162d846057a3118222ca04a01b84297eb8a8db"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VKCOM%2FYouTokenToMe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VKCOM%2FYouTokenToMe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VKCOM%2FYouTokenToMe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VKCOM%2FYouTokenToMe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VKCOM","download_url":"https://codeload.github.com/VKCOM/YouTokenToMe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247036948,"owners_count":20873057,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","natural-language-processing","nlp","tokenization","word-segmentation"],"created_at":"2024-08-01T13:00:49.164Z","updated_at":"2025-04-03T16:31:04.444Z","avatar_url":"https://github.com/VKCOM.png","language":"C++","readme":"![PyPI](https://img.shields.io/pypi/v/youtokentome.svg)\n[![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)\n![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg)\n[![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe)\n\n# YouTokenToMe \n\nYouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].\nOur implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)\n and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 60 times faster.\n  Check out our [benchmark](benchmark.md) results.\n  \nKey advantages:\n\n* Multithreading for training and tokenization\n* The algorithm has  `O(N)` complexity, where `N` is the length of training data\n* Highly efficient implementation in C++\n* Python wrapper and command-line interface\n\nExtra features:\n* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))\n\nAs well as in the algorithm from the original paper, ours does not consider tokens \nthat cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol \"▁\" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.\n\nFor example, the phrase ```Blazingly fast tokenization!``` can be tokenized into\n\n`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`\n\n## Installation\n\n```bash\npip install youtokentome\n```\n## Python interface \n\n### Example\nLet's start with a self-contained example. \n\n```python\nimport random\n\nimport youtokentome as yttm\n\ntrain_data_path = \"train_data.txt\"\nmodel_path = \"example.model\"\n\n# Generating random file with training data\n# 10000 lines with 100 characters in each line\nn_lines = 10000\nn_characters = 100\nwith open(train_data_path, \"w\") as fout:\n    for _ in range(n_lines):\n        print(\"\".join([random.choice(\"abcd \") for _ in range(n_characters)]), file=fout)\n\n# Generating random text\ntest_text = \"\".join([random.choice(\"abcde \") for _ in range(100)])\n\n# Training model\nyttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)\n\n# Loading model\nbpe = yttm.BPE(model=model_path)\n\n# Two types of tokenization\nprint(bpe.encode([test_text], output_type=yttm.OutputType.ID))\nprint(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))\n```\n\n\u0026nbsp;\n### Training model\n```python\nyoutokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)\n```\nTrains BPE model and saves to file.\n\n**Args:**\n \n* `data`: string, path to file with training data\n* `model`: string, path to where the trained model will be saved\n* `vocab_size`: int, number of tokens in the final vocabulary\n* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.\n* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)).\n* `pad_id`: int, reserved id for padding\n* `unk_id`: int, reserved id for unknown symbols\n* `bos_id`: int, reserved id for begin of sentence token\n* `eos_id`: int, reserved id for end of sentence token\n \n**Returns**: Class `youtokentome.BPE` with the loaded model.\n \n\n\u0026nbsp;\n\n### Model loading\n\n```python\nyoutokentome.BPE(model, n_threads=-1)\n```\n\nClass constructor. Loads the trained model.\n\n* `model`: string, path to the trained model\n* `n_threads`: int, number of parallel threads used to run. \n    If equal to -1, then the maximum number of threads available will be used.\n \n\u0026nbsp;\n  \n### Methods\nClass `youtokentome.BPE` has the following methods:\n#### encode \n```python\nencode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)\n```\n\n**Args:**\n  \n* `sentences`: list of strings, sentences for tokenization.\n* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords.\n* `bos`: bool, if True then token “beginning of sentence” will be added\n* `eos`: bool, if True then token “end of sentence” will be added\n* `reverse`: bool, if True the output sequence of tokens will be reversed\n* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].\n\n  \n**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD` \n then a list of lists of integers or list of lists of strings will be returned\nrespectively.\n\n\u0026nbsp;\n#### vocab\n\n```python\nvocab(self)\n```\n\n**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds\n to i-th subword.\n \n\u0026nbsp;\n#### vocab_size\n\n```python\nvocab_size(self)\n```\n\n**Returns:** int. Size of vocabulary.\n\n\u0026nbsp;\n#### subword_to_id\n\n```python\nsubword_to_id(self, subword)\n```\n**Args:**\n* `subword`: string. \n\n**Returns:** \nInteger from the range [0, vocab_size-1]. Id of subword or,\n if there is no such subword in the vocabulary, `unk_id` will be \nreturned.\n\n\u0026nbsp;\n#### id_to_subword \n\n```python\nid_to_subword(self, id)\n```\n**Args:**\n* `id`: int, must be in the range [0, vocab_size-1]\n\n**Returns:** string. Subword from vocabulary by id.\n  \n\u0026nbsp;\n#### decode \n```python\ndecode(self, ids, ignore_ids=None)\n```  \nConvert each id to subword and concatenate with space symbol.\n\n**Args:**\n\n  * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1]\n  * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]\n\n  \n**Returns:** List of strings.  \n \n## Command line interface\n\n### Example \n\n```bash\n$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000\n$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword \u003c TEST_DATA_FILE \u003e ENCODED_DATA \n```\n\n\n### Supported commands\n\n`YouTokenToMe` supports the following commands:\n\n```\n$ yttm --help\n\nUsage: yttm [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  bpe     Train BPE model.\n  decode  Decode ids to text.\n  encode  Encode text to ids or subwords.\n  vocab   Print list of learned subwords.\n```\n\nCommand `bpe` allows you to train Byte Pair Encoding model based on a text file.\n\n```\n$ yttm bpe --help\n\nUsage: yttm bpe [OPTIONS]\n\n  Train BPE model.\n\nOptions:\n  --data PATH           Training data file path.  [required]\n  --model PATH          Output model file path.  [required]\n  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]\n  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]\n  --n_threads INTEGER   Number of threads.  [default: -1]\n  --pad_id INTEGER      Padding token id.  [default: 0]\n  --unk_id INTEGER      Unknown token id.  [default: 1]\n  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]\n  --eos_id INTEGER      'End of sentence' token id.  [default: 3]\n  --help                Show this message and exit.\n```\n\n\nApply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.\n\nBy default, encoding works in parallel using `n_threads` threads. Number of threads is limited by\n8 (see [benchmark](benchmark.md#number-of-threads)).\n\nWith the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.\n Each sentence will be tokenized and written to the `stdout` before the next sentence is read.\n\n\n```\n$ yttm encode --help\n\nUsage: yttm encode [OPTIONS]\n\n  Encode text to ids or subwords.\n\nOptions:\n  --model PATH         Path to file with learned model.  [required]\n  --output_type TEXT   'id' or 'subword'.  [required]\n  --n_threads INTEGER  Number of threads.  [default: -1]\n  --bos                Add tab 'begin of sentence'.\n  --eos                Add tab 'end of sentence'.\n  --reverse            Reverse output sequence of tokens.\n  --stream             Process each line before reading the next one.\n  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]\n  --help               Show this message and exit.\n```\n\nPrint vocabulary. This can be useful for understanding the model.\n\n```\n$ yttm vocab --help\n\nUsage: yttm vocab [OPTIONS]\n\n  Print list of learned subwords.\n\nOptions:\n  --model PATH  Path to file with learned model.  [required]\n  --verbose     Add merging rules.\n  --help        Show this message and exit.\n```\n\nConvert ids back to text. Use `stdin` for input and `stdout` for output.\n\n```\n$ yttm decode --help\n\nUsage: yttm decode [OPTIONS]\n\n  Decode ids to text.\n\nOptions:\n  --model PATH  Path to file with learned model.  [required]\n  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3\n  --help        Show this message and exit.\n```\n\n\n\n\n\n\n\n","funding_links":[],"categories":["C++","文本数据和NLP","🔹 **BPE (Byte Pair Encoding) Implementations**"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVKCOM%2FYouTokenToMe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVKCOM%2FYouTokenToMe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVKCOM%2FYouTokenToMe/lists"}