{"id":19464914,"url":"https://github.com/brianhempel/fuzzy_tools","last_synced_at":"2025-04-25T09:31:50.534Z","repository":{"id":3432737,"uuid":"4484813","full_name":"brianhempel/fuzzy_tools","owner":"brianhempel","description":"Fuzzy document finding in Ruby","archived":false,"fork":false,"pushed_at":"2017-10-18T22:10:33.000Z","size":7564,"stargazers_count":23,"open_issues_count":3,"forks_count":8,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-17T17:58:18.605Z","etag":null,"topics":["rubynlp"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brianhempel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-05-29T16:49:32.000Z","updated_at":"2023-01-31T15:08:38.000Z","dependencies_parsed_at":"2022-09-12T11:44:23.198Z","dependency_job_id":null,"html_url":"https://github.com/brianhempel/fuzzy_tools","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brianhempel%2Ffuzzy_tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brianhempel%2Ffuzzy_tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brianhempel%2Ffuzzy_tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brianhempel%2Ffuzzy_tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brianhempel","download_url":"https://codeload.github.com/brianhempel/fuzzy_tools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250790115,"owners_count":21487749,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["rubynlp"],"created_at":"2024-11-10T18:18:06.485Z","updated_at":"2025-04-25T09:31:48.943Z","avatar_url":"https://github.com/brianhempel.png","language":"Ruby","funding_links":[],"categories":["Language Aware String Manipulation"],"sub_categories":["Text-to-Speech-to-Text"],"readme":"# FuzzyTools [![Build Status](https://secure.travis-ci.org/brianhempel/fuzzy_tools.png)](http://travis-ci.org/brianhempel/fuzzy_tools) [![Dependency Status](https://gemnasium.com/brianhempel/fuzzy_tools.png)](https://gemnasium.com/brianhempel/fuzzy_tools)\n\nFuzzyTools is a toolset for fuzzy searches in Ruby. The default algorithm has been tuned for accuracy (and reasonable speed) on 23 different [test files](https://github.com/brianhempel/fuzzy_tools/tree/master/accuracy/test_data/query_tests) gathered from [many sources](https://github.com/brianhempel/fuzzy_tools/blob/master/accuracy/test_data/sources/SOURCES.txt).\n\nBecause it's mostly Ruby, FuzzyTools is best for searching smaller datasets—say less than 50Kb in size. Data cleaning or auto-complete over known options are potential uses.\n\nTested on Ruby 1.8.7, 1.9.2, 1.9.3, 2.0.0dev, JRuby (1.8 and 1.9 mode), and Rubinius (1.9 mode only). \n\n## Usage\n\nInstall with [Bundler](http://gembundler.com/):\n\n``` ruby\ngem \"fuzzy_tools\"\n```\n\nInstall without Bundler:\n\n    gem install fuzzy_tools --no-ri --no-rdoc\n\nThen, put it to work!\n\n``` ruby\nrequire 'fuzzy_tools'\n\nbooks = [\n  \"Till We Have Faces\",\n  \"Ecclesiastes\",\n  \"The Prodigal God\"\n]\n\n# Search for a single object\n\nbooks.fuzzy_find(\"facade\")                                   # =\u003e \"Till We Have Faces\"\nbooks.fuzzy_index.find(\"facade\")                             # =\u003e \"Till We Have Faces\"\nFuzzyTools::TfIdfIndex.new(:source =\u003e books).find(\"facade\")  # =\u003e \"Till We Have Faces\"\n\n# Search for all matches, from best to worst\n\nbooks.fuzzy_find_all(\"the\")                             # =\u003e [\"The Prodigal God\", \"Till We Have Faces\"]\nbooks.fuzzy_index.all(\"the\")                            # =\u003e [\"The Prodigal God\", \"Till We Have Faces\"]\nFuzzyTools::TfIdfIndex.new(:source =\u003e books).all(\"the\") # =\u003e [\"The Prodigal God\", \"Till We Have Faces\"]\n\n# You can also get scored results, if you need\n\nbooks.fuzzy_find_all_with_scores(\"the\") # =\u003e\n# [\n#   [\"The Prodigal God\",   0.443175985397319 ],\n#   [\"Till We Have Faces\", 0.0102817553829306]\n# ]\nbooks.fuzzy_index.all_with_scores(\"the\") # =\u003e\n# [\n#   [\"The Prodigal God\",   0.443175985397319 ],\n#   [\"Till We Have Faces\", 0.0102817553829306]\n# ]\nFuzzyTools::TfIdfIndex.new(:source =\u003e books).all_with_scores(\"the\") # =\u003e\n# [\n#   [\"The Prodigal God\",   0.443175985397319 ],\n#   [\"Till We Have Faces\", 0.0102817553829306]\n# ]\n```\n\nFuzzyTools is not limited to searching strings. In fact, strings work simply because FuzzyTools indexes on `to_s` by default. You can index on any method you like.\n\n``` ruby\nrequire 'fuzzy_tools'\n\nBook = Struct.new(:title, :author)\n\nbooks = [\n  Book.new(\"Till We Have Faces\", \"C.S. Lewis\" ),\n  Book.new(\"Ecclesiastes\",       \"The Teacher\"),\n  Book.new(\"The Prodigal God\",   \"Tim Keller\" )\n]\n\nbooks.fuzzy_find(:author =\u003e \"timmy\")\nbooks.fuzzy_index(:attribute =\u003e :author).find(\"timmy\")\nFuzzyTools::TfIdfIndex.new(:source =\u003e books, :attribute =\u003e :author).find(\"timmy\")\n# =\u003e #\u003cstruct Book title=\"The Prodigal God\", author=\"Tim Keller\"\u003e\n\nbooks.fuzzy_find_all(:author =\u003e \"timmy\")\nbooks.fuzzy_index(:attribute =\u003e :author).all(\"timmy\")\nFuzzyTools::TfIdfIndex.new(:source =\u003e books, :attribute =\u003e :author).all(\"timmy\")\n# =\u003e\n# [\n#   #\u003cstruct Book title=\"The Prodigal God\", author=\"Tim Keller\" \u003e,\n#   #\u003cstruct Book title=\"Ecclesiastes\",     author=\"The Teacher\"\u003e\n# ]\n\nbooks.fuzzy_find_all_with_scores(:author =\u003e \"timmy\")\nbooks.fuzzy_index(:attribute =\u003e :author).all_with_scores(\"timmy\")\nFuzzyTools::TfIdfIndex.new(:source =\u003e books, :attribute =\u003e :author).all_with_scores(\"timmy\")\n# =\u003e\n# [\n#   [#\u003cstruct Book title=\"The Prodigal God\", author=\"Tim Keller\" \u003e, 0.29874954780727  ],\n#   [#\u003cstruct Book title=\"Ecclesiastes\",     author=\"The Teacher\"\u003e, 0.0117801403002398]\n# ]\n```\n\nIf the objects to be searched are hashes, FuzzyTools indexes the specified hash value.\n\n```ruby\nbooks = [\n  { :title =\u003e \"Till We Have Faces\", :author =\u003e \"C.S. Lewis\"  },\n  { :title =\u003e \"Ecclesiastes\",       :author =\u003e \"The Teacher\" },\n  { :title =\u003e \"The Prodigal God\",   :author =\u003e \"Tim Keller\"  }\n]\n\nbooks.fuzzy_find(:author =\u003e \"timmy\")\n# =\u003e { :title =\u003e \"The Prodigal God\",   :author =\u003e \"Tim Keller\"  }\n```\n\nIf you want to index on some calculated data such as more than one field at a time, you can provide a proc.\n\n``` ruby\nbooks.fuzzy_find(\"timmy\", :attribute =\u003e lambda { |book| book.title + \" \" + book.author })\nbooks.fuzzy_index(:attribute =\u003e lambda { |book| book.title + \" \" + book.author }).find(\"timmy\")\nFuzzyTools::TfIdfIndex.new(:source =\u003e books, :attribute =\u003e lambda { |book| book.title + \" \" + book.author }).find(\"timmy\")\n```\n\n## Can it go faster?\n\nIf you need to do multiple searches on the same collection, grab a fuzzy index with `my_collection.fuzzy_index` and do finds on that. The `fuzzy_find`, `fuzzy_find_all`, and `fuzzy_find_all_with_scores` methods on Enumerable reindex every time they are called.\n\nHere's a performance comparison:\n\n``` ruby\narray_methods = Array.new.methods\n\nBenchmark.bm(20) do |b|\n  b.report(\"fuzzy_find\") do\n    1000.times { array_methods.fuzzy_find(\"juice\") }\n  end\n\n  b.report(\"fuzzy_index.find\") do\n    index = array_methods.fuzzy_index\n    1000.times { index.find(\"juice\") }\n  end\nend\n```\n\n```\n                          user     system      total        real\nfuzzy_find           29.250000   0.040000  29.290000 ( 29.287992)\nfuzzy_index.find      0.360000   0.000000   0.360000 (  0.360066)\n```\n\nIf you need even more speed, you can [try a different tokenizer](#specifying-your-own-tokenizer). Fewer tokens per document shortens the comparison time between documents, lessens the garbage collector load, and reduces the number of candidate documents for a given query.\n\nIf it's still too slow, [open an issue](https://github.com/brianhempel/fuzzy_tools/issues) and perhaps we can figure out what can be done.\n\n## How does it work?\n\nFuzzyTools downcases and then tokenizes each value using a [hybrid combination](https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy_tools/tokenizers.rb#L20-27) of words, [character bigrams](http://en.wikipedia.org/wiki/N-gram), [Soundex](http://en.wikipedia.org/wiki/Soundex), and words without vowels.\n\n``` ruby\nFuzzyTools::Tokenizers::HYBRID.call(\"Till We Have Faces\")\n# =\u003e [\"T400\", \"W000\", \"H100\", \"F220\", \"_t\", \"ti\", \"il\", \"ll\", \"l \", \" w\",\n#     \"we\", \"e \", \" h\", \"ha\", \"av\", \"ve\", \"e \", \" f\", \"fa\", \"ac\", \"ce\",\n#     \"es\", \"s_\", \"tll\", \"w\", \"hv\", \"fcs\", \"till\", \"we\", \"have\", \"faces\"]\n```\n\nGross, eh? But that's what worked best on the [test data sets](https://github.com/brianhempel/fuzzy_tools/tree/master/accuracy/test_data/query_tests).\n\nThe tokens are weighted using [Term Frequency * Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf*idf) which basically assigns higher weights to the tokens that occur in fewer documents.\n\n```ruby\n# hacky introspection here--don't do this!\nindex = books.fuzzy_index(:attribute =\u003e :author)\nindex.instance_variable_get(:@document_tokens)[\"The Teacher\"].weights.sort_by { |k,v| [-v,k] }\n# =\u003e\n# [\n#   [\"he\",      0.3910],\n#   [\"th\",      0.3910],\n#   [\" t\",      0.2467],\n#   [\"T000\",    0.2467],\n#   [\"T260\",    0.2467],\n#   [\"ac\",      0.2467],\n#   [\"ch\",      0.2467],\n#   [\"e \",      0.2467],\n#   [\"ea\",      0.2467],\n#   [\"tchr\",    0.2467],\n#   [\"te\",      0.2467],\n#   [\"teacher\", 0.2467],\n#   [\"the\",     0.2467],\n#   [\"_t\",      0.0910],\n#   [\"er\",      0.0910],\n#   [\"r_\",      0.0910]\n# ]\n```\n\nWhen you do a query, that query string is tokenized and weighted, then compared against some of the documents using [Cosine Similarity](http://www.gettingcirrius.com/2010/12/calculating-similarity-part-1-cosine.html). Cosine similarity is not that terrible of a concept, assuming you like terms like \"N-dimensional space\". Basically, each unique token becomes an axis in N-dimensional space. If we had 4 different tokens in all, we'd use 4-D space. A document's token weights define a vector in this space. The _cosine_ of the _angle_ between documents' vectors becomes the similarity between the documents.\n\nTrust me, it works.\n\n## Specifying your own tokenizer\n\nIf the default tokenizer isn't working for your data or you need more speed, you can try swapping out the tokenizers. You can use one of the various tokenizers defined in [`FuzzyTools::Tokenizers`](https://github.com/brianhempel/fuzzy_tools/blob/master/lib/fuzzy_tools/tokenizers.rb), or you can write your own.\n\n``` ruby\n# a predefined tokenizer\nbooks.fuzzy_find(\"facade\", :tokenizer =\u003e FuzzyTools::Tokenizers::CHARACTERS)\nbooks.fuzzy_index(:tokenizer =\u003e FuzzyTools::Tokenizers::CHARACTERS).find(\"facade\")\nFuzzyTools::TfIdfIndex.new(:source =\u003e books, :tokenizer =\u003e FuzzyTools::Tokenizers::CHARACTERS).find(\"facade\")\n\n# roll your own\npunctuation_normalizer = lambda { |str| str.downcase.split.map { |word| word.gsub(/\\W/, '') } }\nbooks.fuzzy_find(\"facade\", :tokenizer =\u003e punctuation_normalizer)\nbooks.fuzzy_index(:tokenizer =\u003e punctuation_normalizer).find(\"facade\")\nFuzzyTools::TfIdfIndex.new(:source =\u003e books, :tokenizer =\u003e punctuation_normalizer).find(\"facade\")\n```\n## I've heard of Soft TF-IDF. It's supposed to be better than TF-IDF.\n\nDespite the impressive graphs, the \"Soft TF-IDF\" described in [WW Cohen, P Ravikumar, and SE Fienberg, A comparison of string distance metrics for name-matching tasks, IIWEB, pages 73-78, 2003](http://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf) didn't give me good results. In the paper, they tokenized by word. The standard TF-IDF tokenized by character 4-grams or 5-grams may have been more effective.\n\nIn my tests, the word-tokenized Soft TF-IDF was significantly slower and considerably less accurate than a standard TF-IDF with n-gram tokenization.\n\n## Help make it better!\n\nNeed something added? Please [open an issue](https://github.com/brianhempel/fuzzy_tools/issues)! Or, even better, code it yourself and send a pull request:\n\n    # fork it on github, then clone:\n    git clone git@github.com:your_username/fuzzy_tools.git\n    bundle install\n    rspec\n    # hack away\n    git push\n    # then make a pull request\n\n## Acknowledgements\n\nThe [SecondString](http://secondstring.sourceforge.net/) source code was a valuable reference.\n\n## License\n\nAuthored by Brian Hempel. Public domain, no restrictions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrianhempel%2Ffuzzy_tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrianhempel%2Ffuzzy_tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrianhempel%2Ffuzzy_tools/lists"}