{"id":18284607,"url":"https://github.com/lixi5338619/lxparse","last_synced_at":"2025-10-27T18:38:54.886Z","repository":{"id":57465235,"uuid":"527154637","full_name":"lixi5338619/lxparse","owner":"lixi5338619","description":"用于解析列表页链接和提取详细页内容的库","archived":false,"fork":false,"pushed_at":"2023-10-26T09:27:01.000Z","size":146,"stargazers_count":17,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-21T00:32:52.799Z","etag":null,"topics":["crawler","htmlparse","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lixi5338619.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-21T09:02:46.000Z","updated_at":"2024-07-15T14:13:03.000Z","dependencies_parsed_at":"2024-11-05T13:14:20.550Z","dependency_job_id":"b9ad67f8-4b11-459f-b735-f3092226593d","html_url":"https://github.com/lixi5338619/lxparse","commit_stats":{"total_commits":9,"total_committers":2,"mean_commits":4.5,"dds":"0.33333333333333337","last_synced_commit":"587258892ead223b21169af49682bab3a72ffcfe"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Flxparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Flxparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Flxparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Flxparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lixi5338619","download_url":"https://codeload.github.com/lixi5338619/lxparse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247305875,"owners_count":20917198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","htmlparse","python"],"created_at":"2024-11-05T13:14:09.698Z","updated_at":"2025-10-27T18:38:49.838Z","avatar_url":"https://github.com/lixi5338619.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lxparse\n\n用于解析列表页链接和提取详情页内容的库\n\nA library for intelligently parsing list page links and details page contents\n\n---\n\n## 项目背景\n\n现有2000个政企网站信源，要短时间实现动态监测。\n\n手写时即要查看网站类型又要分析数据接口，然后配置解析规则，人都看麻了。所以写一个自动提取列表页链接的方法。\n\n奈何国内的网站不止由一千个哈姆雷特开发，几乎不存在通用的解析方法，只能说尽量让列表页链接提取更便捷一些。\n\nlxparse中列表页解析借助了readability的主体抽取方法，详情页解析引用了gen的一些正则匹配方法。\n\n---\n\n## 实现逻辑\n\n#### 列表页\n\n1、提取列表页主体，删除html中的无关标签，主要以a标签聚焦程度为评估标准\n\n2、通过xpath规则筛选主体中存在的a标签，以h、ul/li、tr/td 为主，返回链接数组\n\n3、通过余弦公式计算数组中所有url的相似度，保留相似度较高的url，返回链接数组\n\n4、从数组中再次过滤，保留符合规则的链接\n\n#### 详情页\n\n- 标题、作者、来源：以常见规则匹配，并筛选和评估最优解\n- 时间：以常见规则和正文内容匹配，经过处理和验证后返回时间格式\n- 正文：readability的主体抽取方法，返回带标签和格式化的正文内容\n\n ---\n\n## 使用方法\n安装： pip install lxparse\n\n调用：\n```python\nfrom lxparse import LxParse\nlx = LxParse()\n\nlist_html = \"\"\nlx.parse_list(list_html)\n# 指定解析规则\nlx.parse_list(list_html,xpath_list='/div[@id=\"lx\"]/a')\n\ndetail_html = \"\"\nlx.parse_detail(detail_html)\n# 指定解析规则,不声明则使用默认规则\nxpath_item = {\n    'xpath_title':'',\n    'xpath_source':'',\n    'xpath_date':'',\n    'xpath_author':'',\n    'xpath_content':'',\n}\nlx.parse_detail(detail_html,item=xpath_item)\n```\n\nparse_detail 返回：\n![Alt](./image/detail.png)\n\n\n---\n\n## 测试代码\n- demo文件中分别有列表页和详情页的解析示例\n- 将html保存本地后，经测试今日头条、新浪新闻、百度资讯、网易新闻、腾讯新闻等可正常解析。\n\n---\n\n## 备注\n- 使用lxparse解析库解析时，如有解析异常的可手动指定解析规则。\n- 测试用例不多，如有问题麻烦提issues一起优化。\n- 或者关注公众号《Pythonlx》，获取群聊二维码，一起交流学习\n\n\n![Alt](./image/wx.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flixi5338619%2Flxparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flixi5338619%2Flxparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flixi5338619%2Flxparse/lists"}