{"id":13880539,"url":"https://github.com/ankane/tokenizers-ruby","last_synced_at":"2025-11-17T14:04:59.286Z","repository":{"id":38394295,"uuid":"471823838","full_name":"ankane/tokenizers-ruby","owner":"ankane","description":"Fast state-of-the-art tokenizers for Ruby","archived":false,"fork":false,"pushed_at":"2025-10-12T18:26:55.000Z","size":510,"stargazers_count":155,"open_issues_count":0,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-11-13T08:15:33.356Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ankane.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-03-19T22:13:06.000Z","updated_at":"2025-11-07T01:16:52.000Z","dependencies_parsed_at":"2022-08-25T05:11:35.275Z","dependency_job_id":"3ba44bef-01f3-425e-8a99-9bcf1897f77a","html_url":"https://github.com/ankane/tokenizers-ruby","commit_stats":{"total_commits":250,"total_committers":2,"mean_commits":125.0,"dds":0.06399999999999995,"last_synced_commit":"4905da26354610f6324ad3f9ff47cfdea1de746a"},"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"purl":"pkg:github/ankane/tokenizers-ruby","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Ftokenizers-ruby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Ftokenizers-ruby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Ftokenizers-ruby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Ftokenizers-ruby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ankane","download_url":"https://codeload.github.com/ankane/tokenizers-ruby/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ankane%2Ftokenizers-ruby/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":284406074,"owners_count":26999550,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-14T02:00:06.101Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T08:03:08.642Z","updated_at":"2025-11-17T14:04:59.262Z","avatar_url":"https://github.com/ankane.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# Tokenizers Ruby\n\n:slightly_smiling_face: Fast state-of-the-art [tokenizers](https://github.com/huggingface/tokenizers) for Ruby\n\n[![Build Status](https://github.com/ankane/tokenizers-ruby/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/tokenizers-ruby/actions)\n\n## Installation\n\nAdd this line to your application’s Gemfile:\n\n```ruby\ngem \"tokenizers\"\n```\n\n## Getting Started\n\nLoad a pretrained tokenizer\n\n```ruby\ntokenizer = Tokenizers.from_pretrained(\"bert-base-cased\")\n```\n\nEncode\n\n```ruby\nencoded = tokenizer.encode(\"I can feel the magic, can you?\")\nencoded.tokens\nencoded.ids\n```\n\nDecode\n\n```ruby\ntokenizer.decode(ids)\n```\n\n## Training\n\nCreate a tokenizer\n\n```ruby\ntokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: \"[UNK]\"))\n```\n\nSet the pre-tokenizer\n\n```ruby\ntokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new\n```\n\nTrain the tokenizer ([example data](https://huggingface.co/docs/tokenizers/quicktour#build-a-tokenizer-from-scratch))\n\n```ruby\ntrainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: [\"[UNK]\", \"[CLS]\", \"[SEP]\", \"[PAD]\", \"[MASK]\"])\ntokenizer.train([\"wiki.train.raw\", \"wiki.valid.raw\", \"wiki.test.raw\"], trainer)\n```\n\nEncode\n\n```ruby\noutput = tokenizer.encode(\"Hello, y'all! How are you 😁 ?\")\noutput.tokens\n```\n\nSave the tokenizer to a file\n\n```ruby\ntokenizer.save(\"tokenizer.json\")\n```\n\nLoad a tokenizer from a file\n\n```ruby\ntokenizer = Tokenizers.from_file(\"tokenizer.json\")\n```\n\nCheck out the [Quicktour](https://huggingface.co/docs/tokenizers/quicktour) and equivalent [Ruby code](https://github.com/ankane/tokenizers-ruby/blob/master/test/quicktour_test.rb#L8) for more info\n\n## API\n\nThis library follows the [Tokenizers Python API](https://huggingface.co/docs/tokenizers/index). You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.\n\n## History\n\nView the [changelog](https://github.com/ankane/tokenizers-ruby/blob/master/CHANGELOG.md)\n\n## Contributing\n\nEveryone is encouraged to help improve this project. Here are a few ways you can help:\n\n- [Report bugs](https://github.com/ankane/tokenizers-ruby/issues)\n- Fix bugs and [submit pull requests](https://github.com/ankane/tokenizers-ruby/pulls)\n- Write, clarify, or fix documentation\n- Suggest or add new features\n\nTo get started with development:\n\n```sh\ngit clone https://github.com/ankane/tokenizers-ruby.git\ncd tokenizers-ruby\nbundle install\nbundle exec rake compile\nbundle exec rake download:files\nbundle exec rake test\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fankane%2Ftokenizers-ruby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fankane%2Ftokenizers-ruby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fankane%2Ftokenizers-ruby/lists"}