{"id":13484497,"url":"https://github.com/alexandru/stuff-classifier","last_synced_at":"2025-03-27T16:30:53.551Z","repository":{"id":2261179,"uuid":"3216950","full_name":"alexandru/stuff-classifier","owner":"alexandru","description":"simple text classifier(s) implemetation in ruby","archived":true,"fork":false,"pushed_at":"2018-01-17T06:31:31.000Z","size":73,"stargazers_count":449,"open_issues_count":8,"forks_count":91,"subscribers_count":24,"default_branch":"master","last_synced_at":"2024-10-30T03:37:52.594Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alexandru.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-01-19T11:19:45.000Z","updated_at":"2024-07-17T15:54:48.000Z","dependencies_parsed_at":"2022-08-28T01:40:50.686Z","dependency_job_id":null,"html_url":"https://github.com/alexandru/stuff-classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexandru%2Fstuff-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexandru%2Fstuff-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexandru%2Fstuff-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexandru%2Fstuff-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alexandru","download_url":"https://codeload.github.com/alexandru/stuff-classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245882285,"owners_count":20687860,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T17:01:25.287Z","updated_at":"2025-03-27T16:30:53.175Z","avatar_url":"https://github.com/alexandru.png","language":"Ruby","readme":"# stuff-classifier\n\n## No longer maintained\n\nThis repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.\n\n## Description\n\nA library for classifying text into multiple categories.\n\nCurrently provided classifiers:\n\n- a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)\n- a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)\n\nRan a benchmark of 1345 items that I have previously manually\nclassified with multiple categories. Here's the rate over which the 2\nalgorithms have correctly detected one of those categories:\n\n- Bayes: 79.26%\n- Tf-Idf: 81.34%\n\nI prefer the Naive Bayes approach, because while having lower stats on\nthis benchmark, it seems to make better decisions than I did in many\ncases. For example, an item with title *\"Paintball Session, 100 Balls\nand Equipment\"* was classified as *\"Activities\"* by me, but the bayes\nclassifier identified it as *\"Sports\"*, at which point I had an\nintellectual orgasm. Also, the Tf-Idf classifier seems to do better on\nclear-cut cases, but doesn't seem to handle uncertainty so well. Of\ncourse, these are just quick tests I made and I have no idea which is\nreally better.\n\n## Install\n\n```bash\ngem install stuff-classifier\n```\n\n## Usage\n\nYou either instantiate one class or the other. Both have the same\nsignature:\n\n```ruby\nrequire 'stuff-classifier'\n\n# for the naive bayes implementation\ncls = StuffClassifier::Bayes.new(\"Cats or Dogs\")\n\n# for the Tf-Idf based implementation\ncls = StuffClassifier::TfIdf.new(\"Cats or Dogs\")\n\n# these classifiers use word stemming by default, but if it has weird\n# behavior, then you can disable it on init:\ncls = StuffClassifier::TfIdf.new(\"Cats or Dogs\", :stemming =\u003e false)\n\n# also by default, the parsing phase filters out stop words, to\n# disable or to come up with your own list of stop words, on a\n# classifier instance you can do this:\ncls.ignore_words = [ 'the', 'my', 'i', 'dont' ]\n ```\n\nTraining the classifier:\n\n```ruby\ncls.train(:dog, \"Dogs are awesome, cats too. I love my dog\")\ncls.train(:cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\")    \ncls.train(:dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\")\ncls.train(:cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\")\ncls.train(:dog, \"So which one should you choose? A dog, definitely.\")\ncls.train(:cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\")\ncls.train(:dog, \"A dog will eat anything, including birds or whatever meat\")\ncls.train(:cat, \"My cat's favorite place to purr is on my keyboard\")\ncls.train(:dog, \"My dog's favorite place to take a leak is the tree in front of our house\")\n```\n\nAnd finally, classifying stuff:\n\n```ruby\ncls.classify(\"This test is about cats.\")\n#=\u003e :cat\ncls.classify(\"I hate ...\")\n#=\u003e :cat\ncls.classify(\"The most annoying animal on earth.\")\n#=\u003e :cat\ncls.classify(\"The preferred company of software developers.\")\n#=\u003e :cat\ncls.classify(\"My precious, my favorite!\")\n#=\u003e :cat\ncls.classify(\"Get off my keyboard!\")\n#=\u003e :cat\ncls.classify(\"Kill that bird!\")\n#=\u003e :cat\n\ncls.classify(\"This test is about dogs.\")\n#=\u003e :dog\ncls.classify(\"Cats or Dogs?\") \n#=\u003e :dog\ncls.classify(\"What pet will I love more?\")    \n#=\u003e :dog\ncls.classify(\"Willy, where the heck are you?\")\n#=\u003e :dog\ncls.classify(\"I like big buts and I cannot lie.\") \n#=\u003e :dog\ncls.classify(\"Why is the front door of our house open?\")\n#=\u003e :dog\ncls.classify(\"Who is eating my meat?\")\n#=\u003e :dog\n```\n\n## Persistency\n\nThe following layers for saving the training data between sessions are\nimplemented:\n\n- in memory (by default)\n- on disk\n- Redis\n- (coming soon) in a RDBMS\n\nTo persist the data in Redis, you can do this:\n```ruby\n# defaults to redis running on localhost on default port\nstore = StuffClassifier::RedisStorage.new(@key)\n\n# pass in connection args\nstore = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})\n```\n\nTo persist the data on disk, you can do this:\n\n```ruby\nstore = StuffClassifier::FileStorage.new(@storage_path)\n\n# global setting\nStuffClassifier::Base.storage = store\n\n# or alternative local setting on instantiation, by means of an\n# optional param ...\ncls = StuffClassifier::Bayes.new(\"Cats or Dogs\", :storage =\u003e store)\n\n# after training is done, to persist the data ...\ncls.save_state\n\n# or you could just do this:\nStuffClassifier::Bayes.open(\"Cats or Dogs\") do |cls|\n  # when done, save_state is called on END\nend\n\n# to start fresh, deleting the saved training data for this classifier\nStuffClassifier::Bayes.new(\"Cats or Dogs\", :purge_state =\u003e true)\n```\n\nThe name you give your classifier is important, as based on it the\ndata will get loaded and saved. For instance, following 3 classifiers\nwill be stored in different buckets, being independent of each other.\n\n```ruby\ncls1 = StuffClassifier::Bayes.new(\"Cats or Dogs\")\ncls2 = StuffClassifier::Bayes.new(\"True or False\")\ncls3 = StuffClassifier::Bayes.new(\"Spam or Ham\")\t\n```\n\n## License\n\nMIT Licensed. See LICENSE.txt for details.\n\n\n","funding_links":[],"categories":["Ruby","Scientific"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexandru%2Fstuff-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexandru%2Fstuff-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexandru%2Fstuff-classifier/lists"}