Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ankane/youtokentome-ruby
High performance unsupervised text tokenization for Ruby
https://github.com/ankane/youtokentome-ruby
bpe byte-pair-encoding npl tokenization unsupervised-learning word-segmentation
Last synced: about 1 month ago
JSON representation
High performance unsupervised text tokenization for Ruby
- Host: GitHub
- URL: https://github.com/ankane/youtokentome-ruby
- Owner: ankane
- License: mit
- Created: 2020-02-24T02:04:47.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-12-27T17:43:13.000Z (7 months ago)
- Last Synced: 2024-05-21T10:05:53.720Z (about 2 months ago)
- Topics: bpe, byte-pair-encoding, npl, tokenization, unsupervised-learning, word-segmentation
- Language: Ruby
- Homepage:
- Size: 31.3 KB
- Stars: 21
- Watchers: 5
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Lists
- awesome-stars - ankane/youtokentome-ruby - High performance unsupervised text tokenization for Ruby (Ruby)
README
# YouTokenToMe Ruby
[YouTokenToMe](https://github.com/VKCOM/YouTokenToMe) - high performance unsupervised text tokenization - for Ruby
Learn more about [how it works](https://medium.com/@vktech/youtokentome-a-tool-for-quick-text-tokenization-from-the-vk-team-aa6341215c5a)
[![Build Status](https://github.com/ankane/youtokentome-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/youtokentome-ruby/actions)
## Installation
Add this line to your application’s Gemfile:
```ruby
gem "youtokentome"
```## Getting Started
Dump your text to a file
```txt
Blazingly fast tokenization!
```Train a model
```ruby
model = YouTokenToMe::BPE.train(data: "train.txt", model: "model.txt", vocab_size: 30000)
```Load a model
```ruby
model = YouTokenToMe::BPE.new("model.txt")
```Get vocab
```ruby
model.vocab
```Encode
```ruby
model.encode(sentences)
```Decode
```ruby
model.decode(ids)
```Convert between ids and subwords
```ruby
model.subword_to_id(subword)
model.id_to_subword(id)
```## Options
Train
```ruby
YouTokenToMe::BPE.train(
data: "train.txt", # path to file with training data
model: "model.txt", # path to where the trained model will be saved
vocab_size: 30000, # number of tokens in the final vocabulary
coverage: 1.0, # fraction of characters covered by the model
n_threads: -1, # number of parallel threads used to run
pad_id: 0, # reserved id for padding
unk_id: 1, # reserved id for unknown symbols
bos_id: 2, # reserved id for begin of sentence token
eos_id: 3 # reserved id for end of sentence token
)
```Encode
```ruby
model.encode(
sentences,
output_type: :id, # or :subword
bos: false, # add "beginning of sentence" token
eos: false, # add "end of sentence" token
reverse: false, # reverse output sequence of tokens
dropout_prob: 0.0 # BPE-dropout probability
)
```## History
View the [changelog](https://github.com/ankane/youtokentome-ruby/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/ankane/youtokentome-ruby/issues)
- Fix bugs and [submit pull requests](https://github.com/ankane/youtokentome-ruby/pulls)
- Write, clarify, or fix documentation
- Suggest or add new featuresTo get started with development:
```sh
git clone https://github.com/ankane/youtokentome-ruby.git
cd youtokentome-ruby
bundle install
bundle exec rake compile
bundle exec rake test
```