https://github.com/madcato/word2vec-rb
Ruby interface gem to use word2vec arithmetics.
https://github.com/madcato/word2vec-rb
machine-learning ml nlp ruby word2vec
Last synced: 10 months ago
JSON representation
Ruby interface gem to use word2vec arithmetics.
- Host: GitHub
- URL: https://github.com/madcato/word2vec-rb
- Owner: madcato
- Created: 2021-04-26T08:26:35.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2022-05-06T06:03:17.000Z (almost 4 years ago)
- Last Synced: 2025-04-30T10:49:58.316Z (about 1 year ago)
- Topics: machine-learning, ml, nlp, ruby, word2vec
- Language: C
- Homepage:
- Size: 46.9 KB
- Stars: 8
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
Awesome Lists containing this project
README
# word2vec-rb
Gem using word2vec functionality from https://code.google.com/archive/p/word2vec/
This gem was developed using the `.c` files of the Google word2vec as base. Mostly by applying copy-and-paste.
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'word2vec-rb'
```
And then execute:
$ bundle install
Or install it yourself as:
$ gem install word2vec-rb
## Usage
### Distance arithmetic: to find the nearest words, try:
```ruby
require 'word2vec'
model = Word2vec::Model.load("./data/minimal.bin")
words = model.distance("from")
words.each do |w|
puts "#{w.first} #{w.last}"
end
```
### Analogy arithmetic: to find the analogy with three words, try:
```ruby
require 'word2vec'
model = Word2vec::Model.load("./data/minimal.bin")
words = model.analogy("spain", "madrid", "france")
# In a well prepared vectors file (high quality), first word would be "Paris"
words.each do |w|
puts "#{w.first} #{w.last}"
end
```
### Accuray: test accuracy of the vectors:
Define a file with the analogies to test, format:
: section heading
Word1 Word2 Word3 Word4
Sample:
: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
```ruby
require 'word2vec'
model = Word2vec::Model.load(file_name)
model.accuracy("./data/questions-words.txt")
# Outputs the results on terminal
```
### Vocabulary: create a vocabulary file from a train file:
```ruby
require 'word2vec'
Word2vec::Model.build\_vocab("./data/text7", "./data/vocab.txt")
```
The output file will have a list of words and its number of appearances separated by line break.
### Tokenizer: create a binary file by tokenizing an input file
This method requires a vocabulary file precreated.
```ruby
require 'word2vec'
Word2vec::Model.tokenize("./data/text7", "./data/vocab.txt", "./data/tokenized.bin")
```
The output file will contain a sequence of binary identificators of each word of the input file.
Read output file with:
long long id;
fread(&id, sizeof(id), 1, fi);
### Load the **word2vec** output bin file (*vectors.bin*), into ruby array
```ruby
require 'word2vec'
vector_array = Word2vec::load_vectors("./data/minimal.bin")
```
The `vector_array` variable will contain an array of pairs with the vocab and the vector the float values of each word.
Set parameter `normalize: true` to normalize the vectors.
```ruby
require 'word2vec'
vector_array = Word2vec::Model.load_vectors("./data/minimal.bin", normalize: true)
```
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
### Build extension
$ rake build
### Launch tests
$ rake spec
### Build extension
$ rake compile
## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/madcato/word2vec-rb