{"id":18377051,"url":"https://github.com/bbc/similarity","last_synced_at":"2025-04-05T06:06:38.736Z","repository":{"id":63010133,"uuid":"2053330","full_name":"bbc/Similarity","owner":"bbc","description":"Calculate similarity between documents using TF-IDF weights","archived":false,"fork":false,"pushed_at":"2024-11-27T15:14:26.000Z","size":70,"stargazers_count":115,"open_issues_count":5,"forks_count":26,"subscribers_count":31,"default_branch":"master","last_synced_at":"2025-03-29T05:06:12.936Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bbc.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-07-15T13:53:20.000Z","updated_at":"2023-07-06T16:34:02.000Z","dependencies_parsed_at":"2025-01-09T15:45:46.084Z","dependency_job_id":"b360d9e7-413c-439c-af8a-b52a860521a6","html_url":"https://github.com/bbc/Similarity","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbc%2FSimilarity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbc%2FSimilarity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbc%2FSimilarity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bbc%2FSimilarity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bbc","download_url":"https://codeload.github.com/bbc/Similarity/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247294536,"owners_count":20915340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T00:26:02.992Z","updated_at":"2025-04-05T06:06:38.713Z","avatar_url":"https://github.com/bbc.png","language":"Ruby","readme":"* Similarity\n\n** Overview\n\nA Ruby library for calculating the similarity between pieces of text\nusing a [[http://en.wikipedia.org/wiki/Tf–idf][Term Frequency-Inverse Document Frequency]] method.\n\nA [[http://en.wikipedia.org/wiki/Bag_of_words_model][bag of words]] model is used. Terms in the source documents are\ndowncased and punctuation is removed, but stemming is not currently\nimplemented.\n\nThis library was written to facilitate the creation of diagrams talked\nabout by Jonathan Stray in his\n[[http://jonathanstray.com/a-full-text-visualization-of-the-iraq-war-logs][full-text\nvisualization of the Iraq War Logs]] post. An example of how to\ngenerate a [[http://gephi.org/][Gephi]] compatible file including labelling of nodes with key\nwords is included in the =examples= directory.\n\nThe library depends on the [[http://www.gnu.org/software/gsl/][GNU Scientific Library]], and the [[http://rb-gsl.rubyforge.org/][gsl ruby\ngem]] but does not use sparse matrix representations to speed up the\ncalculations, since there is no support for them in the GSL. I am\ncurrently looking into fixing this, and would appreciate any help!\n\n** Dependencies\n\nSimilarity depends on the [[http://www.gnu.org/software/gsl/][GNU Scientific Library]], and the [[http://rb-gsl.rubyforge.org/][gsl ruby\ngem]]. On OSX with [[https://github.com/mxcl/homebrew]] the GSL can be\ninstalled with\n\n: brew install gsl\n\nThe =gsl= gem should then install normally. For other platforms,\nplease add the information to the wiki and I'll add them to this\nreadme.\n\n** Usage\n\nFirst we load some documents into the corpus\n\n: require 'similarity'\n:\n: corpus = Corpus.new\n:\n: doc1 = Document.new(:content =\u003e \"A document with a lot of additional words some of which are about chunky bacon\")\n: doc2 = Document.new(:content =\u003e \"Another longer document with many words and again about chunky bacon\")\n: doc3 = Document.new(:content =\u003e \"Some text that has nothing to do with pork products\")\n:\n: [doc1, doc2, doc3].each { |doc| corpus \u003c\u003c doc }\n\nThen to compare documents we can use the =similar_documents= method\n\n: corpus.similar_documents(doc1).each do |doc, similarity|\n:  puts \"Similarity between doc #{doc1.id} and doc #{doc.id} is #{similarity}\"\n: end\n:\n: #=\u003e\n:  Similarity between doc 70137042580340 and doc 70137042580340 is 0.9999999999999997\n:  Similarity between doc 70137042580340 and doc 70137042580240 is 0.06068602112714361\n:  Similarity between doc 70137042580340 and doc 70137042580160 is 0.04882114791611661\n\nThe cross-similarity matrix (useful for creating graphs) is also available\n\n: similarity_matrix = corpus.similarity_matrix\n\nFor more examples, see the =examples= directory.\n\n** Todo\n- Performance improvements\n  - Switch to storing document vector spaces in sparse form, using linalg or csparse?\n- (Optional) stemming of source terms\n\n** Contributing\n- Fork the project\n- Send a pull request\n- Don't touch the .gemspec, I'll do that when I release a new version\n\n** Author\n\n[[http://chrislowis.co.uk][Chris Lowis]] - BBC R\u0026D\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbbc%2Fsimilarity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbbc%2Fsimilarity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbbc%2Fsimilarity/lists"}