{"id":21767814,"url":"https://github.com/mouday/pageparser","last_synced_at":"2025-04-13T15:26:38.835Z","repository":{"id":57450254,"uuid":"152832588","full_name":"mouday/PageParser","owner":"mouday","description":"网页解析器，用于网络爬虫解析页面, 不懂网页解析也能写爬虫","archived":false,"fork":false,"pushed_at":"2024-02-18T10:09:45.000Z","size":133,"stargazers_count":50,"open_issues_count":0,"forks_count":17,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-11T10:17:24.693Z","etag":null,"topics":["crawler","parser","python","spider"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mouday.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-13T03:48:17.000Z","updated_at":"2025-03-30T12:02:49.000Z","dependencies_parsed_at":"2022-09-10T00:30:59.025Z","dependency_job_id":null,"html_url":"https://github.com/mouday/PageParser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mouday%2FPageParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mouday%2FPageParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mouday%2FPageParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mouday%2FPageParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mouday","download_url":"https://codeload.github.com/mouday/PageParser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248734391,"owners_count":21153202,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","parser","python","spider"],"created_at":"2024-11-26T13:30:11.767Z","updated_at":"2025-04-13T15:26:38.805Z","avatar_url":"https://github.com/mouday.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PageParser\n\n[![Build Status](https://travis-ci.org/mouday/PageParser.svg?branch=master)](https://travis-ci.org/mouday/PageParser)\n![GitHub](https://img.shields.io/github/license/mashape/apistatus.svg)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/page-parser.svg)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/page-parser.svg)\n![PyPI](https://img.shields.io/pypi/v/page-parser.svg)\n![GitHub last commit](https://img.shields.io/github/last-commit/mouday/PageParser.svg)\n![PyPI - Format](https://img.shields.io/pypi/format/page-parser.svg)\n\n## 项目简介\n\n项目名称：六行代码写爬虫\n\n英文名称：PageParser\n\n项目简介：一个爬虫使用的网页解析包，实现最大限度的代码复用\n\n项目目标：不懂网页解析也能写爬虫\n\n\u003e 注意：本项目仅用于学习交流，不可用于商业项目\n\n## 安装模块\n```\npip install page-parser\n```\n\n最小项目示例：\n\n```python\nimport page_parser\n\n# 1、指定网页\nurl = \"https://www.baidu.com/\"\n\n# 2、解析网页\nitems = page_parser.parse(url)\n\n# 3、输出数据\nfor item in items: print(item)\n# {'title': '百度一下，你就知道'}\n```\n\n## 支持网页\n\n| 序号 |网站 | 网页名称 |网页地址 |\n| - |- | - | - |\n| 1 |百度 | 主页 | https://www.baidu.com/ |\n| 2 |豆瓣 | 电影正在热映 | https://movie.douban.com/ |\n| 3 |拉勾 | 招聘职位列表页 | https://www.lagou.com/zhaopin/ |\n| 4 |企查查 | 融资事件页 | https://www.qichacha.com/elib_financing |\n| 5 |西刺代理 | 主页 | http://www.xicidaili.com/ |\n| 6 |西刺代理 | 国内高匿代理 | http://www.xicidaili.com/nn/ |\n| 7 |西刺代理 | 国内普通代理 | http://www.xicidaili.com/nt/ |\n| 8 |西刺代理 | 国内HTTPS代理 | http://www.xicidaili.com/wn/ |\n| 9 |西刺代理 | 国内HTTP代理 | http://www.xicidaili.com/wt/ |\n| 10 |搜狗搜索 | 微信公众号搜索页 | https://weixin.sogou.com/weixin?type=1\u0026query=百度 |\n| 11 | 煎蛋网 | 主页列表 | http://jandan.net/|\n|12| 伯乐在线 | python栏目 | http://python.jobbole.com/|\n\n\n## 网络爬虫工作流程：\n\n```\n页面下载器 -\u003e 页面解析器 -\u003e 数据存储\n\n```\n\n`页面下载器`: 主要涉及防爬攻破，方法各异，爬虫的难点也在此\n\n`页面解析器`: 一般页面在一段时间内是固定的，每个人下载页面后都需要解析出页面内容，属于重复工作\n\n`数据存储`: 不管是存储到什么文件或数据库，主要看业务需求\n\n此项目就是将这项工作抽离出来，让网络爬虫程序重点关注于：网页下载，而不是重复的网页解析\n\n## 项目说明\n\n此项目可以和python 的requests 和scrapy 配合使用\n\n当然如果要和其他编程语言使用，可以使用flask等网络框架再次对此项目进行封装，提供网络接口即可\n\n发起人：mouday\n\n发起时间：2018-10-13\n\n需要更多的人一起来维护\n\n## 贡献代码\n\n贡献的代码统一放入文件夹：page_parser\n\n代码示例，如没有更好的理由，应该按照下面的格式，便于使用者调用\n\nbaidu_parser.py\n\n## 说明：\n\n### 原则：\n\n1. 按照网站分类建立解析类\n\n2. 解析方法包含在解析类中为方便调用需要静态方法\n\n3. 因为网页解析有时效性，所以必须`注明日期`\n\n\n### 命名规则：\n例如:\n```\n文件名：baidu_parser\n类名：BaiduParser\n方法名：parse_index\n```\n\n### 其他\n\n1. 必要的代码注释\n\n2. 必要的测试代码\n\n3. 其他必要的代码\n\n\n## 加入我们\n### 基本要求\n1. python的基本语法 + 面向对象 + 迭代器（yield）\n2. 掌握的库：requests、parsel、scrapy（了解即可）\n3. 解析库统一使用parsel（基于xpath），简单高效，与scrapy无缝衔接\n4. 不太懂也没关系，自己看参考文章，只要愿意学就会，瞬间提升自己\n\n参考文章：\n\n1. [Python编程：class类面向对象](https://blog.csdn.net/mouday/article/details/79002712)\n\n2. [Python编程：生成器yield与yield from区别简单理解](https://blog.csdn.net/mouday/article/details/80760973)\n\n3. [Python爬虫：requests库基本使用](https://blog.csdn.net/mouday/article/details/80087627)\n\n4. [Python网络爬虫之scrapy框架](https://blog.csdn.net/mouday/article/details/79736108)\n\n5. [Python爬虫：xpath常用方法示例](https://blog.csdn.net/mouday/article/details/80364436)\n\n6. [python爬虫：scrapy框架xpath和css选择器语法](https://blog.csdn.net/mouday/article/details/80455560)\n\n### 联系方式\n\nPageParser QQ群号: 932301512\n\n![](images/page-parser-min.jpeg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmouday%2Fpageparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmouday%2Fpageparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmouday%2Fpageparser/lists"}