Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexandru/stuff-classifier
simple text classifier(s) implemetation in ruby
https://github.com/alexandru/stuff-classifier
Last synced: 13 days ago
JSON representation
simple text classifier(s) implemetation in ruby
- Host: GitHub
- URL: https://github.com/alexandru/stuff-classifier
- Owner: alexandru
- License: mit
- Archived: true
- Created: 2012-01-19T11:19:45.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2018-01-17T06:31:31.000Z (almost 7 years ago)
- Last Synced: 2024-10-30T03:37:52.594Z (13 days ago)
- Language: Ruby
- Homepage:
- Size: 71.3 KB
- Stars: 449
- Watchers: 24
- Forks: 91
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-ruby - stuff-classifier - A library for classifying text into multiple categories. (Scientific)
README
# stuff-classifier
## No longer maintained
This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.
## Description
A library for classifying text into multiple categories.
Currently provided classifiers:
- a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)Ran a benchmark of 1345 items that I have previously manually
classified with multiple categories. Here's the rate over which the 2
algorithms have correctly detected one of those categories:- Bayes: 79.26%
- Tf-Idf: 81.34%I prefer the Naive Bayes approach, because while having lower stats on
this benchmark, it seems to make better decisions than I did in many
cases. For example, an item with title *"Paintball Session, 100 Balls
and Equipment"* was classified as *"Activities"* by me, but the bayes
classifier identified it as *"Sports"*, at which point I had an
intellectual orgasm. Also, the Tf-Idf classifier seems to do better on
clear-cut cases, but doesn't seem to handle uncertainty so well. Of
course, these are just quick tests I made and I have no idea which is
really better.## Install
```bash
gem install stuff-classifier
```## Usage
You either instantiate one class or the other. Both have the same
signature:```ruby
require 'stuff-classifier'# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]
```Training the classifier:
```ruby
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
```And finally, classifying stuff:
```ruby
cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :catcls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?")
#=> :dog
cls.classify("What pet will I love more?")
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.")
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog
```## Persistency
The following layers for saving the training data between sessions are
implemented:- in memory (by default)
- on disk
- Redis
- (coming soon) in a RDBMSTo persist the data in Redis, you can do this:
```ruby
# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})
```To persist the data on disk, you can do this:
```ruby
store = StuffClassifier::FileStorage.new(@storage_path)# global setting
StuffClassifier::Base.storage = store# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)# after training is done, to persist the data ...
cls.save_state# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
# when done, save_state is called on END
end# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)
```The name you give your classifier is important, as based on it the
data will get loaded and saved. For instance, following 3 classifiers
will be stored in different buckets, being independent of each other.```ruby
cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")
```## License
MIT Licensed. See LICENSE.txt for details.