{"id":18402192,"url":"https://github.com/gerapy/gerapyautoextractor","last_synced_at":"2025-04-12T21:21:47.133Z","repository":{"id":45380139,"uuid":"275440480","full_name":"Gerapy/GerapyAutoExtractor","owner":"Gerapy","description":"Auto Extractor Module ","archived":false,"fork":false,"pushed_at":"2024-08-19T12:53:15.000Z","size":724,"stargazers_count":328,"open_issues_count":17,"forks_count":83,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-04-04T01:07:26.764Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.org/project/gerapy-auto-extractor/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gerapy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-27T19:32:02.000Z","updated_at":"2025-03-26T03:01:01.000Z","dependencies_parsed_at":"2024-11-29T13:15:35.394Z","dependency_job_id":null,"html_url":"https://github.com/Gerapy/GerapyAutoExtractor","commit_stats":{"total_commits":65,"total_committers":5,"mean_commits":13.0,"dds":0.09230769230769231,"last_synced_commit":"d65ddd09d4714db4615ea0d522829b20ca7805e5"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyAutoExtractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyAutoExtractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyAutoExtractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gerapy%2FGerapyAutoExtractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gerapy","download_url":"https://codeload.github.com/Gerapy/GerapyAutoExtractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248632535,"owners_count":21136707,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:41:33.879Z","updated_at":"2025-04-12T21:21:47.113Z","avatar_url":"https://github.com/Gerapy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gerapy Auto Extractor\n\n![Python package](https://github.com/Gerapy/GerapyAutoExtractor/workflows/Python%20package/badge.svg)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gerapy-auto-extractor)\n![PyPI](https://img.shields.io/pypi/v/gerapy-auto-extractor)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/gerapy-auto-extractor)\n![License](https://img.shields.io/badge/license-Apache%202-blue)\n\nThis is the Auto Extractor Module for [Gerapy](https://github.com/Gerapy/Gerapy), You can also use it separately.\n\nYou can use this package to distinguish between list page and detail page, and we can use it to extract\n`url` from list page and also extract `title`, `datetime`, `content` from detail page without any XPath or Selector.\n\nIt works better for Chinese News Website than other scenarios.\n\nIntroduction: [Introduction](https://www.v2ex.com/t/687948)\n\n## Installation\n\nYou can use this command to install this package:\n\n```\npip3 install gerapy-auto-extractor\n```\n\n## Usage\n\nBelow are the methods this package implemented:\n\n### Extraction of List Page\n\nFor list page, you can use `extract_list` method to extract the main list urls and their titles.\n\n### Extraction of Detail Page\n\nFor detail page, you can use `extract_title` method to extract title, use `extract_content` method to extract content,\nuse `extract_datetime` method to extract datetime.\n\nAlso you can use `extract_detail` method to extract all above attrs, results are joined as a json.\n\n### Classification of List/Detail Page\n\nYou can use `is_list` or `is_detail` method to distinguish if this page is list page or detail page, the type of returned result is `bool`.\nAlso you can use `probability_of_list` or `probability_of_detail` method to get the probability of the classification of this page, the type of returned result is `float`.\n\nUsage example:\n\n```python\nfrom gerapy_auto_extractor import extract_list, extract_detail, is_detail, is_list, probability_of_detail, probability_of_list\nfrom gerapy_auto_extractor.helpers import content, jsonify\n\nhtml = content('samples/list/sample.html')\nprint(jsonify(extract_list(html)))\n\nhtml = content('samples/detail/sample.html')\nprint(jsonify(extract_detail(html)))\n\nhtml = content('samples/detail/sample.html')\nprint(probability_of_detail(html), probability_of_list(html))\nprint(is_detail(html), is_list(html))\n\nhtml = content('samples/list/sample.html')\nprint(probability_of_detail(html), probability_of_list(html))\nprint(is_detail(html), is_list(html), )\n```\n\nHTML files can be found in [samples](./samples).\n\nBelow are outputs:\n\n```\n[\n  {\n    \"title\": \"山东通报\\\"苟晶事件\\\"：15人被处理部分事实有反转\",\n    \"url\": \"http://news.163.com/20/0703/13/FGK7NCOR0001899O.html\"\n  },\n  {\n    \"title\": \"胡锡进：香港这仗就是要让华盛顿明白，它管多了\",\n    \"url\": \"https://news.163.com/20/0702/19/FGI8IUEP0001899O.html\"\n  },\n  {\n    \"title\": \"山东一校长为儿子伪造档案11岁开始领国家工资\",\n    \"url\": \"https://news.163.com/20/0702/21/FGIENBGS0001899O.html\"\n  },\n  {\n    \"title\": \"大理西洱河又现\\\"鱼腾\\\"奇景市民沿岸围观有人徒手抓\",\n    \"url\": \"https://news.163.com/20/0704/03/FGLOFC3P0001875P.html\"\n  },\n  {\n    \"title\": \"陈国基被任命为香港特别行政区国安委秘书长\",\n    \"url\": \"https://news.163.com/20/0702/12/FGHFAVS200018AOQ.html\"\n  },\n  {\n    \"title\": \"孙力军等6名中管干部被查上半年反腐数据说明啥？\",\n    \"url\": \"https://news.163.com/20/0703/00/FGIPQ11D0001899O.html\"\n  },\n  {\n    \"title\": \"香港特区政府严厉谴责暴徒恶行全力支持警队严正执法\",\n    \"url\": \"https://news.163.com/20/0702/09/FGH801750001899O.html\"\n  }\n]\n{\n  \"title\": \"美国新冠肺炎确诊病例超278万例 死亡129227例\",\n  \"datetime\": \"2020-07-04 01:55:04+08:00\",\n  \"content\": \"（原标题：美国新冠肺炎确诊病例超过278万例）\\n根据约翰斯·霍普金斯大学的最新数据统计，截至美东时间7月3日16时33分，美国新冠肺炎确诊病例超过278万例，为2780916例，死亡病例为129227例。新增确诊病例数较当日9时33分公布的数据增长了40563例。\\n目前，美国至少有19个州已经下令要求民众在公共场合佩戴口罩。佛罗里达州坦帕市市长简·卡斯特在7月3日接受电视采访时表示，“在美国的任何地方都没有反对戴口罩的好理由”。卡斯特已经下令要求该市民众必须在公众场合佩戴口罩，并认为没有理由反对在全州范围颁发“口罩强制令”，但她表示最好的方式“是各市与县政府自行发出命令。”除该市外，佛罗里达州的迈阿密，杰克逊维尔和棕榈滩县也开始要求民众在公共场所戴口罩。\\n截至目前，佛罗里达有接近17万人确诊新冠肺炎，日增新确诊病例数于7月2日突破1万例，今日统计再新增9488例。\\n【世卫：即使20%的人拥有抗体 新冠病毒还能有效传播】\\n当地时间7月3日，世卫组织召开新冠肺炎发布会，世卫组织卫生紧急项目负责人迈克尔·瑞安表示，人群中出现任何程度的抗体都会提供一定的屏障，因为一旦有人得到保护，病毒就会更难传播，但要想达到防火墙一般的效果，就需要比例较高的人群呈抗体阳性。即使20%的人拥有抗体，病毒还是能够有效传播，同时还要考虑抗体所能提供的保护时长。\\n详情\u003e\u003e\\n【美国西雅图一大学宿舍区超100名学生感染新冠肺炎】\\n当地时间7月3日，据当地媒体报道，西雅图华盛顿大学的宿舍区中暴发新冠肺炎疫情，其中至少105名学生被确诊为新冠肺炎患者。校方表示，目前至少有800名学生进行了新冠病毒检测，其中至少62名确诊学生同属一个社团，目前学校已被通知停止学生一切聚会活动。\\n【白宫“不顾疫情”大搞独立日庆典 预计7500人参加】\\n华盛顿特区的活动只是特朗普为独立日举办的盛大庆典的第二出。当地时间3日，特朗普将前往位于南达科他州的拉什莫尔山国家纪念公园，在著名的“总统山”下发表演讲，届时还将举行烟花表演。据法新社报道，这场活动预计吸引7500人参加，然而戴口罩、保持社交距离等防疫措施依然是靠民众自觉。\\n详情\u003e\u003e\\n【美国至少37州疫情反弹 至少19州发布\\\"口罩强制令\\\"】\\n截至目前，全美至少37个州出现疫情反弹，其中加利福尼亚、亚利桑那、德克萨斯，以及佛罗里达州本周确诊病例数均高于此前日增记录，另有蒙大拿、爱达荷、内华达、佛罗里达、佐治亚、田纳西、路易斯安那、阿拉斯加，以及特拉华州新增病例数超过50%。\\n详情\u003e\u003e\\n【非洲地区新冠肺炎确诊病例快速增长至43.3万】\\n截至非洲东部时间7月3日，非洲疾控中心数据显示：非洲地区54个国家报告了新冠肺炎确诊病例433500例，死亡10658例，208400人康复。\\n尽管非洲确诊病例数仍在持续增加，但尼日利亚、塞拉利昂等国已决定恢复通航。\\n尼日利亚表示，各大机场将于近期陆续恢复国内航班的运营，其中首都阿布贾和经济中心拉各斯的机场将于8日率先开放，其他城市机场将于11日起陆续开放。国际航班的恢复日期将在适当时候宣布。\\n【巴西卫生部：新冠肺炎感染人数或已超过1050万】\\n巴西卫生部针对巴西全国的一项调查显示，巴西新冠肺炎实际感染人数可能已超过1050万，是目前巴西公布的已确诊新冠肺炎病毒感染人数的7倍以上。\\n这项调查是由巴西卫生部与佩洛塔斯联邦大学联合进行的。这一调查结果2日发表在巴西联邦政府的报告中。这项调查分三个阶段，通过对巴西133定点人口分布最多的城市进行抽样调查，估算出具有新冠病毒抗体的人群比例，并分析巴西全国感染人群的演变。\\n巴西联邦政府希望通过这个研究，帮助地方政府制定相应的经济活动开放或限制措施。\"\n}\n\n0.9990605314033392 0.0009394685966607814\nTrue False\n0.033477426883441685 0.9665225731165583\nFalse True\n```\n\nJust for Beta.\n\nNeeds more effort to improve.\n\n## Reference\n\n### Paper\n\n* [面向不规则列表的网页数据抽取技术的研究](http://www.cnki.com.cn/Article/CJFDTotal-JSYJ201509023.htm)\n* [基于文本及符号密度的网页正文提取方法](https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=CJFQ\u0026dbname=CJFDLAST2019\u0026filename=GWDZ201908029\u0026v=MDY4MTRxVHJXTTFGckNVUkxPZmJ1Wm5GQ2poVXJyQklqclBkTEc0SDlqTXA0OUhiWVI4ZVgxTHV4WVM3RGgxVDM=)\n* [基于块密度加权标签路径特征的Web新闻在线抽取](https://kns.cnki.net/kcms/detail/detail.aspx?filename=PZKX201708010\u0026dbcode=CJFQ\u0026dbname=CJFD2017\u0026v=)\n* [基于DOM树和视觉特征的网页信息自动抽取](http://www.cnki.com.cn/Article/CJFDTOTAL-JSJC201310069.htm)\n\n### Project\n\n* [GeneralNewsExtractor](https://github.com/kingname/GeneralNewsExtractor)\n* [Readability](https://github.com/buriy/python-readability)\n\n## Citing \n\nIf you use Gerapy Auto Extractor in your research or project, please add a reference using the following BibTeX entry.\n\n```\n@misc{cui2020gerapy,\n  author =       {Qingcai Cui},\n  title =        {Gerapy Auto Extractor},\n  howpublished = {\\url{https://github.com/Gerapy/GerapyAutoExtractor}},\n  year =         {2020}\n}\n```\n\n## Changelog\n\nSee [Changelog](./CHANGELOG.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerapy%2Fgerapyautoextractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgerapy%2Fgerapyautoextractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerapy%2Fgerapyautoextractor/lists"}