Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/go-dedup/html-distance
go library for computing the proximity of the HTML pages
https://github.com/go-dedup/html-distance
Last synced: about 1 month ago
JSON representation
go library for computing the proximity of the HTML pages
- Host: GitHub
- URL: https://github.com/go-dedup/html-distance
- Owner: go-dedup
- License: bsd-3-clause
- Created: 2017-06-30T15:10:56.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-06-30T15:33:46.000Z (over 7 years ago)
- Last Synced: 2024-06-20T13:29:16.531Z (6 months ago)
- Language: Go
- Size: 5.86 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# html-distance
html-distance is a go library for computing the proximity of the HTML pages. The implementation similiarity fingerprint is Charikar's simhash.
We used BK Tree (Burkhard and Keller) for verifying if a fingerprint is closed to a set of fingerprint within a defined proximity distance.
Distance is the hamming distance of the fingerprints. Since fingerprint is of size 64 (inherited from hash/fnv), Similiarity is defined as 1 - d / 64.
In normal scenario, similarity > 95% (i.e. d>3) could be considered as duplicated html pages.
## Get the source
```
go get github.com/yahoo/gryffin/html-distance/...
```## Install
```
go install github.com/yahoo/gryffin/html-distance/cmd/html-distance
```## Command Line Interface
```
Usage of html-distance:html-distance url1 url2
```Example 1
```
$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/Fetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.flickr.com/photos/120759744@N07/20374523532/in/photostream/, Got 200
Feature distance is 0. HTML Similarity is 100.00%
```Example 2
```
$ html-distance https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.htmlFetching https://www.yahoo.com/politics/kasichs-reception-on-gay-marriage-important-126109300441.html, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 2. HTML Similarity is 96.88%
```Example 3
```
$ html-distance https://www.flickr.com/photos/120759744@N07/20389369791/ https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.htmlFetching https://www.flickr.com/photos/120759744@N07/20389369791/, Got 200
Fetching https://www.yahoo.com/tech/s/verizon-drop-phone-contracts-end-discounted-phones-201530971--finance.html, Got 200
Feature distance is 9. HTML Similarity is 85.94%
```