Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/go-dedup/html-distance

go library for computing the proximity of the HTML pages
https://github.com/go-dedup/html-distance

Last synced: about 1 month ago
JSON representation

go library for computing the proximity of the HTML pages

Awesome Lists containing this project

README

        

# html-distance

html-distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.

We used BK Tree (Burkhard and Keller) for verifying if a fingerprint is closed to a set of fingerprint within a defined proximity distance.

Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.

In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.

## Get the source

```
go get github.com/yahoo/gryffin/html-distance/...
```

## Install

```
go install github.com/yahoo/gryffin/html-distance/cmd/html-distance
```

## Command Line Interface

```
Usage of html-distance:

html-distance url1 url2
```

Example 1
```
$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/

Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/, Got 200
Feature distance is 0. HTML Similarity is 100.00%
```

Example 2
```
$ html-distance https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html

Fetching https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 2. HTML Similarity is 96.88%
```

Example 3
```
$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html

Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 9. HTML Similarity is 85.94%
```