Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tgalopin/simhashphp

SimHash similarities algorithm implementation for PHP
https://github.com/tgalopin/simhashphp

Last synced: about 1 month ago
JSON representation

SimHash similarities algorithm implementation for PHP

Awesome Lists containing this project

README

        

SimHashPHP
==========

> This is the second version of SimHashPHP. If you are using the version 1 and don't want to
> update your code, please refer to the `1.0-security` branch (https://github.com/tgalopin/SimHashPhp/tree/1.0-security).
> The 1.0 branch will be maintained until the release of a v3 but only the v2 will have lastest features.

What is SimHashPHP ?
--------------------

SimHashPHP is a PHP library that port the SimHash algorithm in PHP.
This algorithm, created by Moses Charikar, provides an efficient way to compute a similarity index between two texts.
It is used by Google internally to detect dupplicate content.

See ["SimHash or the way to compare quickly two datasets"](https://titouangalopin.com/2014/06/29/simhash/) for more informations.

[![Build Status](https://secure.travis-ci.org/tgalopin/SimHashPhp.png?branch=master)](http://travis-ci.org/tgalopin/SimHashPhp)

How to use it ?
---------------

Install it with [Composer](https://getcomposer.org):

``` sh
composer require tga/simhash-php
```

Once installed, include `vendor/autoload.php` to load the library.

The concept of SimHash is described in [this article](https://titouangalopin.com/2014/06/29/simhash/). Here are few examples:

``` php
hash($extractor->extract($text1), \Tga\SimHash\SimHash::SIMHASH_64);
$fp2 = $simhash->hash($extractor->extract($text2), \Tga\SimHash\SimHash::SIMHASH_64);

var_dump($fp1->getBinary());
var_dump($fp2->getBinary());

// Index between 0 and 1 : 0.80073740291681
var_dump($comparator->compare($fp1, $fp2));
```

License
-------

This library is under the MIT license (see LICENSE.md)

About
-----

SimHashPHP is mainly developed by Titouan Galopin.

Reporting an issue or a feature request
---------------------------------------

Issues and feature requests are tracked in the [Github issue tracker](https://github.com/tgalopin/SimHashPhp/issues).