{"id":24062324,"url":"https://github.com/shaltielshmid/minhashsharp","last_synced_at":"2025-10-08T19:09:33.938Z","repository":{"id":204193924,"uuid":"711294634","full_name":"shaltielshmid/MinHashSharp","owner":"shaltielshmid","description":"A Robust Library in C# for Similarity Estimation","archived":false,"fork":false,"pushed_at":"2023-11-30T13:34:50.000Z","size":40,"stargazers_count":7,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-23T14:13:15.602Z","etag":null,"topics":["deduplication","deduplication-filter","lsh","lsh-algorithm","lsh-implementation","minhash","statistics"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shaltielshmid.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-28T19:50:43.000Z","updated_at":"2025-04-08T05:14:25.000Z","dependencies_parsed_at":"2023-11-25T20:37:15.768Z","dependency_job_id":null,"html_url":"https://github.com/shaltielshmid/MinHashSharp","commit_stats":null,"previous_names":["shaltielshmid/minhashsharp"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaltielshmid%2FMinHashSharp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaltielshmid%2FMinHashSharp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaltielshmid%2FMinHashSharp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaltielshmid%2FMinHashSharp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shaltielshmid","download_url":"https://codeload.github.com/shaltielshmid/MinHashSharp/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250447987,"owners_count":21432165,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deduplication","deduplication-filter","lsh","lsh-algorithm","lsh-implementation","minhash","statistics"],"created_at":"2025-01-09T08:39:51.280Z","updated_at":"2025-10-08T19:09:28.907Z","avatar_url":"https://github.com/shaltielshmid.png","language":"C#","readme":"# MinHashSharp - A Robust Library for Similarity Estimation\n\n[![NuGet](https://img.shields.io/nuget/v/MinHashSharp.svg)](https://www.nuget.org/packages/MinHashSharp/)\n\n`MinHashSharp` offers a simple lightweight data structure designed to index and estimate Jaccard similarity between sets. Leveraging its robust structure, it has been successfully tested on datasets as large as 60GB, encompassing tens of millions of documents, while ensuring smooth and efficient operations.\n\n## Installation\n\nTo incorporate MinHashSharp into your project, choose one of the following methods:\n\n### .NET CLI\n```bash\ndotnet add package MinHashSharp\n```\n\n### NuGet Package Manager\n```powershell\nInstall-Package MinHashSharp\n```\n\nFor detailed package information, visit [MinHashSharp on NuGet](https://www.nuget.org/packages/MinHashSharp/).\n\n## Key Features\n\nThe library currently offers two classes:\n\n`MinHash`: A probabilistic data structure for computing Jaccard similarity between sets. \n\n`MinHashLSH`: A class for supporting big-data fast querying using an approximate `Jaccard similarity` threshold.\n\n## Sample usage\n\n```cs\nstring s1 = \"The quick brown fox jumps over the lazy dog and proceeded to run towards the other room\";\nstring s2 = \"The slow purple elephant runs towards the happy fox and proceeded to run towards the other room\";\nstring s3 = \"The quick brown fox jumps over the angry dog and proceeded to run towards the other room\";\n\nvar m1 = new MinHash(numPerm: 128).Update(s1.Split());\nvar m2 = new MinHash(numPerm: 128).Update(s2.Split());\nvar m3 = new MinHash(numPerm: 128).Update(s3.Split());\n\nConsole.WriteLine(m1.Jaccard(m2));// 0.51\n\nvar lsh = new MinHashLSH(threshold: 0.8, numPerm: 128);\n\nlsh.Insert(\"s1\", m1);\nlsh.Insert(\"s2\", m2);\n\nConsole.WriteLine(string.Join(\", \", lsh.Query(m3))); // s1\n```\n\n## Multi-threading\n\nThe library is entirely thread-safe except for the `MinHashLSH.Insert` function (and the custom injected hash function, if relevant). Therefore, you can create `MinHash` objects on multiple threads and query the same `MinHashLSH` object freely. If you are indexing sets on multiple threads, then just make sure to gain exclusive access to the LSH around every `Insert` call:\n\n```cs\nlock (lsh) {\n    lsh.Insert(\"s3\", m3);\n}\n```\n\n## Custom hash function\n\nBy default, the library uses the [Farmhash function](https://opensource.googleblog.com/2014/03/introducing-farmhash.html) introduced by Google for efficiency. For more accurate hashes, one can inject a custom hash function into the `MinHash` object.\n\nFor example, if you want to use the C# default string hash function:\n\n```cs\nstatic uint StringHash(string s) =\u003e (uint)s.GetHashCode();\n\nvar m = new MinHash(numPerm: 128, hashFunc: StringHash).Update(s1.Split());\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshaltielshmid%2Fminhashsharp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshaltielshmid%2Fminhashsharp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshaltielshmid%2Fminhashsharp/lists"}