{"id":21994412,"url":"https://github.com/codelibs/opensearch-minhash","last_synced_at":"2025-06-13T22:08:18.983Z","repository":{"id":44418154,"uuid":"423078725","full_name":"codelibs/opensearch-minhash","owner":"codelibs","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-24T12:43:01.000Z","size":77,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-01T13:20:59.978Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codelibs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-10-31T07:09:36.000Z","updated_at":"2025-05-24T12:43:03.000Z","dependencies_parsed_at":"2023-09-26T17:07:24.390Z","dependency_job_id":"1ec08858-ec0a-4f7f-ba4c-348f153f23b5","html_url":"https://github.com/codelibs/opensearch-minhash","commit_stats":null,"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"purl":"pkg:github/codelibs/opensearch-minhash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fopensearch-minhash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fopensearch-minhash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fopensearch-minhash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fopensearch-minhash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codelibs","download_url":"https://codeload.github.com/codelibs/opensearch-minhash/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelibs%2Fopensearch-minhash/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259727156,"owners_count":22902183,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-29T21:08:50.850Z","updated_at":"2025-06-13T22:08:18.962Z","avatar_url":"https://github.com/codelibs.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"OpenSearch MinHash Plugin\n[![Java CI with Maven](https://github.com/codelibs/opensearch-minhash/actions/workflows/maven.yml/badge.svg)](https://github.com/codelibs/opensearch-minhash/actions/workflows/maven.yml)\n=======================\n\n## Overview\n\nMinHash Plugin provides b-bit MinHash algorithm for OpenSearch.\nUsing a field type and a token filter provided by this plugin, you can add a minhash value to your document.\n\n## Version\n\n[Versions in Maven Repository](https://repo1.maven.org/maven2/org/codelibs/opensearch/opensearch-minhash/)\n\n### Issues/Questions\n\nPlease file an [issue](https://github.com/codelibs/opensearch-minhash/issues \"issue\").\n\n## Installation\n\n    $ $OPENSEARCH_HOME/bin/opensearch-plugin install org.codelibs.opensearch:opensearch-minhash:1.1.0\n\n## Getting Started\n\n### Add MinHash Analyzer\n\nFirst, you need to add a minhash analyzer when creating your index:\n\n    $ curl -XPUT 'localhost:9200/my_index' -d '{\n      \"index\":{\n        \"analysis\":{\n          \"analyzer\":{\n            \"minhash_analyzer\":{\n              \"type\":\"custom\",\n              \"tokenizer\":\"standard\",\n              \"filter\":[\"minhash\"]\n            }\n          }\n        }\n      }\n    }'\n\nYou are free to change tokenizer/char\\_filter/filter settings, but the minhash filter needs to be added as a last filter.\n\n### Add MinHash field\n\nPut a minhash field into an index mapping:\n\n    $ curl -XPUT \"localhost:9200/my_index/_mapping\" -d '{\n      \"properties\":{\n        \"message\":{\n          \"type\":\"string\",\n          \"copy_to\":\"minhash_value\"\n        },\n        \"minhash_value\":{\n          \"type\":\"minhash\",\n          \"store\":true,\n          \"minhash_analyzer\":\"minhash_analyzer\"\n        }\n      }\n    }'\n\nThe field type of minhash is of binary type.\nThe above example calculates a minhash value of the message field and stores it in the minhash\\_value field.\n\n## Get MinHash Value\n\nAdd the following document:\n\n    $ curl -XPUT \"localhost:9200/my_index/_doc/1\" -d '{\n      \"message\":\"Fess is Java based full text search server provided as OSS product.\"\n    }'\n\nThe minhash value is calculated automatically when adding the document.\nYou can check it as below:\n\n    $ curl -XGET \"localhost:9200/my_index/_doc/1?pretty\u0026stored_fields=minhash_value,_source\"\n\nThe response is:\n\n    {\n      \"_index\" : \"my_index\",\n      \"_type\" : \"_doc\",\n      \"_id\" : \"1\",\n      \"_version\" : 1,\n      \"found\" : true,\n      \"_source\":{\n          \"message\":\"Fess is Java based full text search server provided as OSS product.\"\n        },\n      \"fields\" : {\n        \"minhash_value\" : [ \"KV5rsUfZpcZdVojpG8mHLA==\" ]\n      }\n    }\n\n## References\n\n### Change the number of bits and hashes\n\nTo change the number of bits and hashes, set them to a token filter setting:\n\n    $ curl -XPUT 'localhost:9200/my_index' -d '{\n      \"index\":{\n        \"analysis\":{\n          \"analyzer\":{\n            \"minhash_analyzer\":{\n              \"type\":\"custom\",\n              \"tokenizer\":\"standard\",\n              \"filter\":[\"my_minhash\"]\n            }\n          }\n        },\n        \"filter\":{\n          \"my_minhash\":{\n            \"type\":\"minhash\",\n            \"seed\":100,\n            \"bit\":2,\n            \"size\":32\n          }\n        }\n      }\n    }'\n\nThe above allows to set the number of bits to 2, the number of hashes to 32 and the seed of hash to 100.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelibs%2Fopensearch-minhash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodelibs%2Fopensearch-minhash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelibs%2Fopensearch-minhash/lists"}