Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/matiskay/html-similarity
Compare html similarity using structural and style metrics
https://github.com/matiskay/html-similarity
html jaccard-similarity python36 similarity
Last synced: 3 months ago
JSON representation
Compare html similarity using structural and style metrics
- Host: GitHub
- URL: https://github.com/matiskay/html-similarity
- Owner: matiskay
- License: bsd-3-clause
- Created: 2017-10-26T13:19:43.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-05-11T13:04:52.000Z (almost 2 years ago)
- Last Synced: 2024-11-08T13:54:05.670Z (3 months ago)
- Topics: html, jaccard-similarity, python36, similarity
- Language: Python
- Homepage:
- Size: 64.5 KB
- Stars: 210
- Watchers: 5
- Forks: 23
- Open Issues: 6
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
===============
HTML Similarity
===============.. image:: https://travis-ci.org/matiskay/html-similarity.svg?branch=master
:target: https://travis-ci.org/matiskay/html-similarity.. image:: https://codebeat.co/badges/304915eb-48a3-46a8-9ce9-2790c82dc2b8
:target: https://codebeat.co/projects/github-com-matiskay-html-similarity-masterThis package provides a set of functions to measure the similarity between web pages.
Install
=======The quick way::
pip install html-similarity
How it works?
=============Structural Similarity
---------------------Uses sequence comparison of the html tags to compute the similarity.
We not implement the similarity based on tree edit distance because it is slower than sequence comparison.
Style Similarity
----------------Extracts css classes of each html document and calculates the jaccard similarity of the sets of classes.
Joint Similarity (Structural Similarity and Style Similarity)
-------------------------------------------------------------The joint similarity metric is calculated as::
k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)
All the similarity metrics takes values between 0 and 1.
Recommendations for joint similarity
------------------------------------Using `k=0.3` give use better results. The style similarity gives more information about the similarity rather than the structural similarity.
Examples
========Here is a example::
In [1]: html_1 = '''
First Document
'''
In [2]: html_2 = '''
Second document Document
'''
In [3] from html_similarity import style_similarity, structural_similarity, similarity
In [4]: style_similarity(html_1, html_2)
Out[4]: 1.0
In [7]: structural_similarity(html_1, html_2)
Out[7]: 0.9090909090909091
In [8]: similarity(html_1, html_2)
Out[8]: 0.9545454545454546
References
==========
- The idea of sequence comparision was taken from `Page Compare `_.
- The other ideas were taken from `T. Gowda and C. A. Mattmann, Clustering Web Pages Based on Structure and Style Similarity, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180. `_
- Use case `Clustering web pages based on structure and style similarity `_