{"id":13879404,"url":"https://github.com/ankane/youtokentome-ruby","last_synced_at":"2025-07-16T15:32:13.392Z","repository":{"id":56899173,"uuid":"242626021","full_name":"ankane/youtokentome-ruby","owner":"ankane","description":"High performance unsupervised text tokenization for Ruby","archived":true,"fork":false,"pushed_at":"2023-12-27T17:43:13.000Z","size":32,"stargazers_count":21,"open_issues_count":0,"forks_count":1,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-09-22T19:38:08.768Z","etag":null,"topics":["bpe","byte-pair-encoding","npl","tokenization","unsupervised-learning","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ankane.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-24T02:04:47.000Z","updated_at":"2024-06-17T07:58:10.000Z","dependencies_parsed_at":"2023-01-31T07:45:47.631Z","dependency_job_id":null,"html_url":"https://github.com/ankane/youtokentome-ruby","commit_stats":null,"previous_names":["ankane/youtokentome"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Fyoutokentome-ruby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Fyoutokentome-ruby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Fyoutokentome-ruby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Fyoutokentome-ruby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ankane","download_url":"https://codeload.github.com/ankane/youtokentome-ruby/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226143895,"owners_count":17580245,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","byte-pair-encoding","npl","tokenization","unsupervised-learning","word-segmentation"],"created_at":"2024-08-06T08:02:19.888Z","updated_at":"2024-11-24T08:31:19.327Z","avatar_url":"https://github.com/ankane.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"# YouTokenToMe Ruby\n\n[YouTokenToMe](https://github.com/VKCOM/YouTokenToMe) - high performance unsupervised text tokenization - for Ruby\n\nLearn more about [how it works](https://medium.com/@vktech/youtokentome-a-tool-for-quick-text-tokenization-from-the-vk-team-aa6341215c5a)\n\n[![Build Status](https://github.com/ankane/youtokentome-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/youtokentome-ruby/actions)\n\n## Installation\n\nAdd this line to your application’s Gemfile:\n\n```ruby\ngem \"youtokentome\"\n```\n\n## Getting Started\n\nDump your text to a file\n\n```txt\nBlazingly fast tokenization!\n```\n\nTrain a model\n\n```ruby\nmodel = YouTokenToMe::BPE.train(data: \"train.txt\", model: \"model.txt\", vocab_size: 30000)\n```\n\nLoad a model\n\n```ruby\nmodel = YouTokenToMe::BPE.new(\"model.txt\")\n```\n\nGet vocab\n\n```ruby\nmodel.vocab\n```\n\nEncode\n\n```ruby\nmodel.encode(sentences)\n```\n\nDecode\n\n```ruby\nmodel.decode(ids)\n```\n\nConvert between ids and subwords\n\n```ruby\nmodel.subword_to_id(subword)\nmodel.id_to_subword(id)\n```\n\n## Options\n\nTrain\n\n```ruby\nYouTokenToMe::BPE.train(\n  data: \"train.txt\",   # path to file with training data\n  model: \"model.txt\",  # path to where the trained model will be saved\n  vocab_size: 30000,   # number of tokens in the final vocabulary\n  coverage: 1.0,       # fraction of characters covered by the model\n  n_threads: -1,       # number of parallel threads used to run\n  pad_id: 0,           # reserved id for padding\n  unk_id: 1,           # reserved id for unknown symbols\n  bos_id: 2,           # reserved id for begin of sentence token\n  eos_id: 3            # reserved id for end of sentence token\n)\n```\n\nEncode\n\n```ruby\nmodel.encode(\n  sentences,\n  output_type: :id,    # or :subword\n  bos: false,          # add \"beginning of sentence\" token\n  eos: false,          # add \"end of sentence\" token\n  reverse: false,      # reverse output sequence of tokens\n  dropout_prob: 0.0    # BPE-dropout probability\n)\n```\n\n## History\n\nView the [changelog](https://github.com/ankane/youtokentome-ruby/blob/master/CHANGELOG.md)\n\n## Contributing\n\nEveryone is encouraged to help improve this project. Here are a few ways you can help:\n\n- [Report bugs](https://github.com/ankane/youtokentome-ruby/issues)\n- Fix bugs and [submit pull requests](https://github.com/ankane/youtokentome-ruby/pulls)\n- Write, clarify, or fix documentation\n- Suggest or add new features\n\nTo get started with development:\n\n```sh\ngit clone https://github.com/ankane/youtokentome-ruby.git\ncd youtokentome-ruby\nbundle install\nbundle exec rake compile\nbundle exec rake test\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fankane%2Fyoutokentome-ruby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fankane%2Fyoutokentome-ruby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fankane%2Fyoutokentome-ruby/lists"}