{"id":13842958,"url":"https://github.com/SPuerBRead/HTMLSimilarity","last_synced_at":"2025-07-11T17:32:33.036Z","repository":{"id":45204097,"uuid":"195185011","full_name":"SPuerBRead/HTMLSimilarity","owner":"SPuerBRead","description":"网页相似度判断：根据网页结构判断页面相似性 ，可用于相似度计算、越权检测等(Determine page similarity based on HTML page structure)","archived":false,"fork":false,"pushed_at":"2019-07-27T07:17:28.000Z","size":7,"stargazers_count":271,"open_issues_count":0,"forks_count":28,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-08-05T17:34:47.334Z","etag":null,"topics":["diff","html-diff","html-similarity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SPuerBRead.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-04T06:49:29.000Z","updated_at":"2024-07-04T03:10:13.000Z","dependencies_parsed_at":"2022-08-30T23:51:46.433Z","dependency_job_id":null,"html_url":"https://github.com/SPuerBRead/HTMLSimilarity","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SPuerBRead%2FHTMLSimilarity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SPuerBRead%2FHTMLSimilarity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SPuerBRead%2FHTMLSimilarity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SPuerBRead%2FHTMLSimilarity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SPuerBRead","download_url":"https://codeload.github.com/SPuerBRead/HTMLSimilarity/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225745385,"owners_count":17517630,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diff","html-diff","html-similarity"],"created_at":"2024-08-04T17:01:52.325Z","updated_at":"2024-11-21T14:30:31.976Z","avatar_url":"https://github.com/SPuerBRead.png","language":"Python","readme":"# HTMLSimilarity\n根据网页结构判断页面相似性(Determine page similarity based on HTML page structure)\n\n[![PyV](https://img.shields.io/badge/python-3.7-brightgreen.svg)]()\n\n使用方法\n-----------\n\n```\nfrom htmlsimilarity import get_html_similarity\n\nis_similarity, value = get_html_similarity(html_doc1, html_doc2)\n```\n\n说明\n-----------\n\n##### 输入参数：\n* HTML文档1\n* HTML文档2\n* 降维后的维数，默认是5000\n\n##### 返回值：\n* 是否相似\n* 相似值（value\u003c0.2时相似，value\u003e0.2时不相似）\n\n\n判断方法\n-----------\n\n根据网页的DOM树确定网页的模板特征向量，对模板特征向量计算网页结构相似性。\n\n详细参考：[李景阳, 张波. 网页结构相似性确定方法及装置:.](http://cprs.patentstar.com.cn/Search/Detail?ANE=9HCC7BGA7AHACGEA7GAA8BHA5ADA9FGF8CBA9EDA9BDC9FCG)\n\n原理参考上述专利文章，对其判断相似性部分进行简单实现。\n\n用途\n-----------\n\n判断越权时，需要对response进行对比，当后端返回渲染后HTML的情况下，无法直接判断是否出现了越权，利用常规的文本相似度对比如difflib，通过分词或最长公共子串等方法进行判断并不适用于用来判断越权，所以使用根据页面结构判断相似度，确定是否出现了越权。\n","funding_links":[],"categories":["Python","Python (1887)"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSPuerBRead%2FHTMLSimilarity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSPuerBRead%2FHTMLSimilarity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSPuerBRead%2FHTMLSimilarity/lists"}