{"id":21994446,"url":"https://github.com/codelibs/minhash","last_synced_at":"2025-04-30T16:48:07.701Z","repository":{"id":21470798,"uuid":"24789322","full_name":"codelibs/minhash","owner":"codelibs","description":"This provides tools for b-bit MinHash algorism.","archived":false,"fork":false,"pushed_at":"2025-04-13T08:42:53.000Z","size":48,"stargazers_count":35,"open_issues_count":3,"forks_count":10,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-13T09:39:09.475Z","etag":null,"topics":["java","minhash"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codelibs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2014-10-04T12:54:51.000Z","updated_at":"2025-04-13T08:42:56.000Z","dependencies_parsed_at":"2025-04-13T09:37:12.026Z","dependency_job_id":null,"html_url":"https://github.com/codelibs/minhash","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fminhash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fminhash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fminhash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fminhash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codelibs","download_url":"https://codeload.github.com/codelibs/minhash/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251747994,"owners_count":21637408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","minhash"],"created_at":"2024-11-29T21:09:10.665Z","updated_at":"2025-04-30T16:48:07.680Z","avatar_url":"https://github.com/codelibs.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"MinHash Library\n[![Java CI with Maven](https://github.com/codelibs/minhash/actions/workflows/maven.yml/badge.svg)](https://github.com/codelibs/minhash/actions/workflows/maven.yml)\n=======================\n\n## Overview\n\nThis library provides tools for b-bit MinHash algorism.\n\n### Issues/Questions\n\nPlease file an [issue](https://github.com/codelibs/minhash/issues \"issue\").\n\n## Installation\n\n### Maven\n\nPut the following dependency into pom.xml:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003eorg.codelibs\u003c/groupId\u003e\n  \u003cartifactId\u003eminhash\u003c/artifactId\u003e\n  \u003cversion\u003e0.4.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## References\n\n### Calculate MinHash\n\nMinHash class provides tools to calculate MinHash.\n\n```java\nimport org.apache.lucene.analysis.core.WhitespaceTokenizer;\n\n// Lucene's tokenizer parses a text.\nTokenizer tokenizer = new WhitespaceTokenizer();\n// The number of bits for each hash value.\nint hashBit = 1;\n// A base seed for hash functions.\nint seed = 0;\n// The number of hash functions.\nint num = 128;\n// Analyzer for 1-bit 128 hash with default Tokenizer (WhitespaceTokenizer).\nAnalyzer analyzer = MinHash.createAnalyzer(hashBit, seed, num);\n// Analyzer for 1-bit 128 hash with custom Tokenizer.\nAnalyzer analyzer2 = MinHash.createAnalyzer(tokenizer, hashBit, seed, num);\n\nString text = \"Fess is very powerful and easily deployable Enterprise Search Server.\";\n\n// Calculate a minhash value. The size is hashBit*num.\nbyte[] minhash = MinHash.calculate(analyzer, text);\n```\n\n### Compare Texts\n\ncompare method returns a similarity between texts.\nThe value is from 0 to 1.\nBut a value below 0.5 means different texts.\n\n```java\nString text1 = \"Fess is very powerful and easily deployable Search Server.\";\nbyte[] minhash1 = MinHash.calculate(analyzer, text1);\nassertEquals(0.953125f, MinHash.compare(minhash, minhash1));\n\n// Compare a different text.\nString text2 = \"Solr is the popular, blazing fast open source enterprise search platform\";\nbyte[] minhash2 = MinHash.calculate(analyzer, text2);\nassertEquals(0.453125f, MinHash.compare(minhash, minhash2));\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelibs%2Fminhash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodelibs%2Fminhash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelibs%2Fminhash/lists"}