Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ankane/fastText-ruby
Efficient text classification and representation learning for Ruby
https://github.com/ankane/fastText-ruby
Last synced: 3 months ago
JSON representation
Efficient text classification and representation learning for Ruby
- Host: GitHub
- URL: https://github.com/ankane/fastText-ruby
- Owner: ankane
- License: mit
- Created: 2019-10-26T12:53:12.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-05-23T14:47:27.000Z (5 months ago)
- Last Synced: 2024-07-21T13:22:24.495Z (4 months ago)
- Language: Ruby
- Homepage:
- Size: 68.4 KB
- Stars: 194
- Watchers: 8
- Forks: 10
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# fastText Ruby
[fastText](https://fasttext.cc) - efficient text classification and representation learning - for Ruby
[![Build Status](https://github.com/ankane/fastText-ruby/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/fastText-ruby/actions)
## Installation
Add this line to your application’s Gemfile:
```ruby
gem "fasttext"
```## Getting Started
fastText has two primary use cases:
- [text classification](#text-classification)
- [word representations](#word-representations)## Text Classification
Prep your data
```ruby
# documents
x = [
"text from document one",
"text from document two",
"text from document three"
]# labels
y = ["ham", "ham", "spam"]
```> Use an array if a document has multiple labels
Train a model
```ruby
model = FastText::Classifier.new
model.fit(x, y)
```Get predictions
```ruby
model.predict(x)
```Save the model to a file
```ruby
model.save_model("model.bin")
```Load the model from a file
```ruby
model = FastText.load_model("model.bin")
```Evaluate the model
```ruby
model.test(x_test, y_test)
```Get words and labels
```ruby
model.words
model.labels
```> Use `include_freq: true` to get their frequency
Search for the best hyperparameters
```ruby
model.fit(x, y, autotune_set: [x_valid, y_valid])
```Compress the model - significantly reduces size but sacrifices a little performance
```ruby
model.quantize
model.save_model("model.ftz")
```## Word Representations
Prep your data
```ruby
x = [
"text from document one",
"text from document two",
"text from document three"
]
```Train a model
```ruby
model = FastText::Vectorizer.new
model.fit(x)
```Get nearest neighbors
```ruby
model.nearest_neighbors("asparagus")
```Get analogies
```ruby
model.analogies("berlin", "germany", "france")
```Get a word vector
```ruby
model.word_vector("carrot")
```Get a sentence vector
```ruby
model.sentence_vector("sentence text")
```Get words
```ruby
model.words
```Save the model to a file
```ruby
model.save_model("model.bin")
```Load the model from a file
```ruby
model = FastText.load_model("model.bin")
```Use continuous bag-of-words
```ruby
model = FastText::Vectorizer.new(model: "cbow")
```## Parameters
Text classification
```ruby
FastText::Classifier.new(
lr: 0.1, # learning rate
dim: 100, # size of word vectors
ws: 5, # size of the context window
epoch: 5, # number of epochs
min_count: 1, # minimal number of word occurences
min_count_label: 1, # minimal number of label occurences
minn: 0, # min length of char ngram
maxn: 0, # max length of char ngram
neg: 5, # number of negatives sampled
word_ngrams: 1, # max length of word ngram
loss: "softmax", # loss function {ns, hs, softmax, ova}
bucket: 2000000, # number of buckets
thread: 3, # number of threads
lr_update_rate: 100, # change the rate of updates for the learning rate
t: 0.0001, # sampling threshold
label_prefix: "__label__", # label prefix
verbose: 2, # verbose
pretrained_vectors: nil, # pretrained word vectors (.vec file)
autotune_metric: "f1", # autotune optimization metric
autotune_predictions: 1, # autotune predictions
autotune_duration: 300, # autotune search time in seconds
autotune_model_size: nil # autotune model size, like 2M
)
```Word representations
```ruby
FastText::Vectorizer.new(
model: "skipgram", # unsupervised fasttext model {cbow, skipgram}
lr: 0.05, # learning rate
dim: 100, # size of word vectors
ws: 5, # size of the context window
epoch: 5, # number of epochs
min_count: 5, # minimal number of word occurences
minn: 3, # min length of char ngram
maxn: 6, # max length of char ngram
neg: 5, # number of negatives sampled
word_ngrams: 1, # max length of word ngram
loss: "ns", # loss function {ns, hs, softmax, ova}
bucket: 2000000, # number of buckets
thread: 3, # number of threads
lr_update_rate: 100, # change the rate of updates for the learning rate
t: 0.0001, # sampling threshold
verbose: 2 # verbose
)
```## Input Files
Input can be read directly from files
```ruby
model.fit("train.txt", autotune_set: "valid.txt")
model.test("test.txt")
```Each line should be a document
```txt
text from document one
text from document two
text from document three
```For text classification, lines should start with a list of labels prefixed with `__label__`
```txt
__label__ham text from document one
__label__ham text from document two
__label__spam text from document three
```## Pretrained Models
There are a number of [pretrained models](https://fasttext.cc/docs/en/supervised-models.html) you can download
### Language Identification
Download one of the [pretrained models](https://fasttext.cc/docs/en/language-identification.html) and load it
```ruby
model = FastText.load_model("lid.176.ftz")
```Get language predictions
```ruby
model.predict("bon appétit")
```## History
View the [changelog](https://github.com/ankane/fastText-ruby/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/ankane/fastText-ruby/issues)
- Fix bugs and [submit pull requests](https://github.com/ankane/fastText-ruby/pulls)
- Write, clarify, or fix documentation
- Suggest or add new featuresTo get started with development:
```sh
git clone --recursive https://github.com/ankane/fastText-ruby.git
cd fastText-ruby
bundle install
bundle exec rake compile
bundle exec rake test
```