An open API service indexing awesome lists of open source software.

https://github.com/sing1ee/simhash-java

A simple implementation of simhash algorithm by java.
https://github.com/sing1ee/simhash-java

java simhash simhash-java

Last synced: 8 months ago
JSON representation

A simple implementation of simhash algorithm by java.

Awesome Lists containing this project

README

          

simhash-java
============

A simple implementation of simhash algorithm by java.

### Features:

1. compute the simhash of a string

2. compute the similarity between all the strings by building smart index, so we can deal with big data.

### How to use:
- run Main with inputfile and outputfile.

- The format of inputfile(see src/test_in): one doc eachline with the utf8 charset.

- The format of outputfile(see src/test_out):

- start //start flag

- first line // doc

- sencode lien // doc1\tdist where dist is the hamming distance between doc and doc1

- end //end flag

### Future:
1. Build the project to a runnable jar.

2. Improve the performace under big data.

### Note:
1. Before run Main.java, you should choose a better analyzer instead of BinaryWordSeg!