https://github.com/ankane/tomoto
High performance topic modeling for Ruby
https://github.com/ankane/tomoto
latent-dirichlet-allocation lda topic-modeling
Last synced: 3 months ago
JSON representation
High performance topic modeling for Ruby
- Host: GitHub
- URL: https://github.com/ankane/tomoto
- Owner: ankane
- License: mit
- Created: 2020-10-09T22:00:01.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-12-30T02:40:36.000Z (4 months ago)
- Last Synced: 2025-01-25T11:05:11.675Z (3 months ago)
- Topics: latent-dirichlet-allocation, lda, topic-modeling
- Language: C++
- Homepage:
- Size: 135 KB
- Stars: 64
- Watchers: 4
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-topic-models - tomoto - Ruby extension for Gibbs sampling based *tomoto* which is written in C++  (Libraries & Toolkits)
README
# tomoto.rb
:tomato: [tomoto](https://github.com/bab2min/tomotopy) - high performance topic modeling - for Ruby
[](https://github.com/ankane/tomoto-ruby/actions)
## Installation
Add this line to your application’s Gemfile:
```ruby
gem "tomoto"
```## Getting Started
Train a model
```ruby
model = Tomoto::LDA.new(k: 2)
model.add_doc(["tokens", "from", "document", "one"])
model.add_doc(["tokens", "from", "document", "two"])
model.add_doc(["tokens", "from", "document", "three"])
model.train(100) # iterations
```Get the summary
```ruby
model.summary
```Get topic words
```ruby
model.topic_words
```Save the model to a file
```ruby
model.save("model.bin")
```Load the model from a file
```ruby
model = Tomoto::LDA.load("model.bin")
```Get topic probabilities for a document
```ruby
doc = model.docs[0]
doc.topics
```Get the number of words for each topic
```ruby
model.count_by_topics
```Get the vocab
```ruby
model.vocabs
```Get the log likelihood per word
```ruby
model.ll_per_word
```Perform inference for unseen documents
```ruby
doc = model.make_doc(["unseen", "doc"])
topic_dist, ll = model.infer(doc)
```## Models
Supports:
- Latent Dirichlet Allocation (`LDA`)
- Labeled LDA (`LLDA`)
- Partially Labeled LDA (`PLDA`)
- Supervised LDA (`SLDA`)
- Dirichlet Multinomial Regression (`DMR`)
- Generalized Dirichlet Multinomial Regression (`GDMR`)
- Hierarchical Dirichlet Process (`HDP`)
- Hierarchical LDA (`HLDA`)
- Multi Grain LDA (`MGLDA`)
- Pachinko Allocation (`PA`)
- Hierarchical PA (`HPA`)
- Correlated Topic Model (`CT`)
- Dynamic Topic Model (`DT`)## API
This library follows the [tomotopy API](https://bab2min.github.io/tomotopy/v0.9.0/en/). There are a few changes to make it more Ruby-like:
- The `get_` prefix has been removed from methods (`topic_words` instead of `get_topic_words`)
- Methods that return booleans use `?` instead of `is_` (`live_topic?` instead of `is_live_topic`)If a method or option you need isn’t supported, feel free to open an issue.
## Examples
- [LDA](examples/lda_basic.rb)
- [HDP](examples/hdp_basic.rb)## Performance
tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with:
```ruby
Tomoto.isa
```## Parallelism
Choose a [parallelism algorithm](https://bab2min.github.io/tomotopy/v0.9.0/en/#parallel-sampling-algorithms) with:
```ruby
model.train(parallel: :partition)
```Supported values are `:default`, `:none`, `:copy_merge`, and `:partition`.
## History
View the [changelog](https://github.com/ankane/tomoto-ruby/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/ankane/tomoto-ruby/issues)
- Fix bugs and [submit pull requests](https://github.com/ankane/tomoto-ruby/pulls)
- Write, clarify, or fix documentation
- Suggest or add new featuresTo get started with development:
```sh
git clone --recursive https://github.com/ankane/tomoto-ruby.git
cd tomoto-ruby
bundle install
bundle exec rake compile
bundle exec rake test
```