{"id":25056992,"url":"https://github.com/andrewmcloud/consimilo","last_synced_at":"2025-04-14T11:56:45.360Z","repository":{"id":62432130,"uuid":"115877264","full_name":"andrewmcloud/consimilo","owner":"andrewmcloud","description":"A Clojure library for querying large data-sets on similarity","archived":false,"fork":false,"pushed_at":"2019-02-17T18:46:06.000Z","size":549,"stargazers_count":63,"open_issues_count":1,"forks_count":4,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-28T01:05:58.300Z","etag":null,"topics":["clojure","collaborative-filtering","cosine-distance","data-sketches","data-sketching","document-similarity","hamming-distance","jaccard-similarity","lsh","lsh-forest","minhash","minhash-lsh-algorithm","plagiarism-detection","recommender-system","similarity","similarity-search"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andrewmcloud.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-12-31T17:39:02.000Z","updated_at":"2024-05-31T07:51:33.000Z","dependencies_parsed_at":"2022-11-01T21:01:07.125Z","dependency_job_id":null,"html_url":"https://github.com/andrewmcloud/consimilo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewmcloud%2Fconsimilo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewmcloud%2Fconsimilo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewmcloud%2Fconsimilo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewmcloud%2Fconsimilo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andrewmcloud","download_url":"https://codeload.github.com/andrewmcloud/consimilo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248372752,"owners_count":21093138,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","collaborative-filtering","cosine-distance","data-sketches","data-sketching","document-similarity","hamming-distance","jaccard-similarity","lsh","lsh-forest","minhash","minhash-lsh-algorithm","plagiarism-detection","recommender-system","similarity","similarity-search"],"created_at":"2025-02-06T13:46:39.713Z","updated_at":"2025-04-14T11:56:45.257Z","avatar_url":"https://github.com/andrewmcloud.png","language":"Clojure","readme":"# consimilo\n\n[![Build Status](https://travis-ci.org/andrewmcloud/consimilo.svg?branch=master)](https://travis-ci.org/andrewmcloud/consimilo)\n[![Clojars Project](https://img.shields.io/clojars/v/consimilo.svg)](https://clojars.org/consimilo)\n\n## A Clojure library for querying large data-sets on similarity\n\nconsimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support \n*top-k* similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself \nin many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative \nfiltering, context filtering, document similarity, etc...). Searching a corpus for *top-k* similar items quickly grows \nto an unwieldy complexity at relatively small corpus sizes *(n choose 2)*. LSH reduces the search space by \"hashing\" \nitems in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the \nlsh-forest supports a *top-k* most similar items query of ~*O(log n)*. There is an accuracy trade-off that comes with \nthe enormous increase in query speed. More information can be found in chapter 3 of \n[Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf).\n\n## Getting Started\n\nAdd consimilo as a dependency in your project.clj:\n\n```clojure\n[consimilo \"0.1.1\"]\n```\n\nThe main methods you are likely to need are all located in [`core.clj`](./src/consimilo/core.clj). \nImport it with something like:\n\n```clojure\n(ns my-ns (:require [consimilo.core :as consimilo]))\n```\n\n## Building a forest\n\nFirst you need to load the candidates vector into a forest. This vector can represent any arbitrary information \n(e.g. tokens in a document, ngrams, metadata about users, content interactions, context surrounding \ninteractions). The candidates vector must be a collection of maps, each representing an item. The map will have an \n`:id` key which is used to reference the minhash vector in the forest and a `:features` key which is a vector \ncontaining the individual features. `[{:id id1 :features [feature1 feature2 ... featuren]} ... ]`\n\n### Adding feature vectors to a forest\n\nOnce your candidates vector is in the correct form, you can add the items to the forest:\n\n```clojure\n(def my-forest (consimilo/add-all-to-forest candidates-vector))           ;;creates new forest, my-forest\n```\n\nYou can continue to add to this forest by passing it as the first argument to `add-all-to-forest`. The forest data \nstructure is stored in an atom, so the existing forest is modified in place. \n\nNote: upon every call to `add-all-to-forest` an expensive sort function is called to enable *O(log n)* queries. It is \nbetter to add all items to the forest at once or in the case of a live system, add new items to the forest in batches \noffline and replace the production forest.\n\n```clojure\n(consimilo/add-all-to-forest my-forest new-candidates-vector)             ;;updates my-forest in place\n```\n\n### Adding strings and files to a forest (helper functions)\n\nconsimilo provides helper functions for constructing feature vectors from strings and files. By default, a new forest \nis created and stopwords are removed. You may add to an existing forest and/or include stopwords via optional \nparameters `:forest` `:remove-stopwords?`. The optional parameters are defaulted to `:forest (new-forest)` `:remove-stopwords? true`.\n\nAdd a collection of strings to a **new** forest and **remove** stopwords:\n\n```clojure\n(def my-forest (consimilo/add-strings-to-forest\n                 [{:id id1 :features \"my sample string 1\"}\n                  {:id id2 :features \"my sample string 2\"}]))\n```\n\nAdd a collection of strings to an **existing** forest and **do not remove** stopwords: \n\n```clojure\n(consimilo/add-strings-to-forest [{:id id1 :features \"my sample string 1\"}\n                                  {:id id2 :features \"my sample string 2\"}]\n                                 :forest my-forest))               ;;updates my-forest in place\n```\n\nAdd a collection of files to a **new** forest and **remove** stopwords:\n\n```clojure\n(def my-forest (consimilo/add-files-to-forest\n                 [FileObj-1 FileObj-2 FileObj-3 FileObj-n]))              ;;creates new forest, my-forest\n```\n\nNote: when calling `add-files-to-forest` `:id` is auto-generated from the file name and `:features` are generated from \nthe tokenized, extracted text. The same optional parameters available for `add-strings-to-forest` are also available for \n`add-files-to-forest`.\n\n## Querying a forest\n\nOnce you have your forest built, you can query for `k` most similar items to feature-vector `v` by running:\n\n```clojure\n(def results (consimilo/query-forest my-forest k v))\n\n(println (:top-k results)) ;;returns a list of keys ordered by similarity\n(println (:query-hash results)) ;;returns the minhash of the query. Utilized to calculate similarity.\n```  \n\n### Querying a forest with strings and files (helper functions)\n\nconsimilo provides helper functions for querying the forest with strings and files. The helper functions `query-string` \nand `query-file` have an optional parameter `:remove-stopwords?` which is defaulted `true`, removing stopwords. Queries \nagainst strings and files should be made using the same tokenization scheme used to input items in the forest \n(stopwords present or removed).\n\nQuerying a forest with a string:\n\n```clojure\n(def results (consimilo/query-string my-forest k \"my query string\"))\n\n(println (:top-k results)) ;;returns a list of keys ordered by similarity\n(println (:query-hash results)) ;;returns the minhash of the query. Utilized to calculate similarity.\n```  \n\nQuerying a forest with a file:\n\n```clojure\n(def results (consimilo/query-file my-forest k Fileobj))\n\n(println (:top-k results)) ;;returns a list of keys ordered by similarity\n(println (:query-hash results)) ;;returns the minhash of the query. Utilized to calculate similarity.\n  ```\n## Calculating similarity  \n\nconsimilo provides functions for calculating approximate distance / similarity between the query and *top-k* results. \nThe function `similar-k` accepts optional parameters to specify which distance / similarity function should be used. \nFor calculating Jaccard similarity, use: `:sim-fn :jaccard`, for calculating Hamming distance, use: `:sim-fn :hamming`, \nand for calculating cosine distance, use: `:sim-fn :cosine`. `similar-k` returns a hash-map, `keys` are the *top-k* ids \nand `vals` are the similarity / distance. As with the other query functions, queries against strings and files \nshould be made using the same tokenization scheme used to input the items in the forest (stopwords present or removed).\n\n### Querying a forest with strings, files, or feature-vectors and calculating similarity\n\nconsimilo will dispatch to the correct query function based on query type (string, file, collection of features). There are 3 similarity functions available for use: `:consine`, `jaccard`, \u0026 `hamming`.\n\n```clojure\n(def similar-items (consimilo/similarity-k \n                     my-forest\n                     k\n                     query\n                     :sim-fn :cosine))\n\n(println similar-items) ;;{id1 (cosine-distance(query id1)) ... idk (cosine-distance (query idk))}\n```\n\n## Saving and loading forests\n\nconsimilo uses [Nippy](https://github.com/ptaoussanis/nippy) to provide simple, robust, serialization / deserialization \nof your forests.\n\nSerialize and save a forest to a file:\n```clojure\n(consimilo/freeze-forest my-forest \"/tmp/my-saved-forest\")\n```\n\nLoad a forest from a file:\n```clojure\n(def my-forest (consimilo/thaw-forest \"/tmp/my-saved-forest\"))\n```\n\n## Configuration\n\nconsimilo uses [config](https://github.com/yogthos/config) to manage configuration. consimilo has three configurable \noptions: \n   * Number of trees in the forest (default 8): `:trees`\n   * Number of permutation functions used to build the minhash (default 128): `:perms`\n   * Random number seed used to generate minhash functions (default 1) `:seed`\n   \nThe defaults should work well in most cases, however they may be overridden by placing a config.edn file in the \nresources directory of your project. See [`config.edn`](./resources/config.edn). \n\n## Contributions / Issues\n\nPlease use the project's GitHub issues page for questions, ideas, etc. Pull requests are welcome.\n\n## License\n\nCopyright 2018 Andrew McLoud\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewmcloud%2Fconsimilo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrewmcloud%2Fconsimilo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewmcloud%2Fconsimilo/lists"}