https://github.com/codelibs/minhash
This provides tools for b-bit MinHash algorism.
https://github.com/codelibs/minhash
java minhash
Last synced: about 2 months ago
JSON representation
This provides tools for b-bit MinHash algorism.
- Host: GitHub
- URL: https://github.com/codelibs/minhash
- Owner: codelibs
- License: apache-2.0
- Created: 2014-10-04T12:54:51.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2025-04-13T08:42:53.000Z (3 months ago)
- Last Synced: 2025-04-13T09:39:09.475Z (3 months ago)
- Topics: java, minhash
- Language: Java
- Size: 46.9 KB
- Stars: 35
- Watchers: 8
- Forks: 10
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
MinHash Library
[](https://github.com/codelibs/minhash/actions/workflows/maven.yml)
=======================## Overview
This library provides tools for b-bit MinHash algorism.
### Issues/Questions
Please file an [issue](https://github.com/codelibs/minhash/issues "issue").
## Installation
### Maven
Put the following dependency into pom.xml:
```xml
org.codelibs
minhash
0.4.0```
## References
### Calculate MinHash
MinHash class provides tools to calculate MinHash.
```java
import org.apache.lucene.analysis.core.WhitespaceTokenizer;// Lucene's tokenizer parses a text.
Tokenizer tokenizer = new WhitespaceTokenizer();
// The number of bits for each hash value.
int hashBit = 1;
// A base seed for hash functions.
int seed = 0;
// The number of hash functions.
int num = 128;
// Analyzer for 1-bit 128 hash with default Tokenizer (WhitespaceTokenizer).
Analyzer analyzer = MinHash.createAnalyzer(hashBit, seed, num);
// Analyzer for 1-bit 128 hash with custom Tokenizer.
Analyzer analyzer2 = MinHash.createAnalyzer(tokenizer, hashBit, seed, num);String text = "Fess is very powerful and easily deployable Enterprise Search Server.";
// Calculate a minhash value. The size is hashBit*num.
byte[] minhash = MinHash.calculate(analyzer, text);
```### Compare Texts
compare method returns a similarity between texts.
The value is from 0 to 1.
But a value below 0.5 means different texts.```java
String text1 = "Fess is very powerful and easily deployable Search Server.";
byte[] minhash1 = MinHash.calculate(analyzer, text1);
assertEquals(0.953125f, MinHash.compare(minhash, minhash1));// Compare a different text.
String text2 = "Solr is the popular, blazing fast open source enterprise search platform";
byte[] minhash2 = MinHash.calculate(analyzer, text2);
assertEquals(0.453125f, MinHash.compare(minhash, minhash2));
```