https://github.com/sing1ee/simhash-java
A simple implementation of simhash algorithm by java.
https://github.com/sing1ee/simhash-java
java simhash simhash-java
Last synced: 8 months ago
JSON representation
A simple implementation of simhash algorithm by java.
- Host: GitHub
- URL: https://github.com/sing1ee/simhash-java
- Owner: sing1ee
- License: mit
- Created: 2013-04-08T15:38:14.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2020-10-10T06:40:48.000Z (over 5 years ago)
- Last Synced: 2025-03-31T10:38:19.502Z (10 months ago)
- Topics: java, simhash, simhash-java
- Language: Java
- Size: 1.52 MB
- Stars: 155
- Watchers: 11
- Forks: 80
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
simhash-java
============
A simple implementation of simhash algorithm by java.
### Features:
1. compute the simhash of a string
2. compute the similarity between all the strings by building smart index, so we can deal with big data.
### How to use:
- run Main with inputfile and outputfile.
- The format of inputfile(see src/test_in): one doc eachline with the utf8 charset.
- The format of outputfile(see src/test_out):
- start //start flag
- first line // doc
- sencode lien // doc1\tdist where dist is the hamming distance between doc and doc1
- end //end flag
### Future:
1. Build the project to a runnable jar.
2. Improve the performace under big data.
### Note:
1. Before run Main.java, you should choose a better analyzer instead of BinaryWordSeg!