{"id":16296238,"url":"https://github.com/jannchie/simpyder","last_synced_at":"2025-06-16T05:41:31.096Z","repository":{"id":51366824,"uuid":"234370090","full_name":"Jannchie/simpyder","owner":"Jannchie","description":"超高速异步协程Python爬虫","archived":false,"fork":false,"pushed_at":"2023-02-15T15:54:32.000Z","size":67,"stargazers_count":77,"open_issues_count":0,"forks_count":24,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-11T20:22:30.614Z","etag":null,"topics":["crawler","python","spider"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/simpyder/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Jannchie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"jannchie","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":["https://azz.net/jannchie"]}},"created_at":"2020-01-16T17:07:50.000Z","updated_at":"2024-05-16T06:20:39.000Z","dependencies_parsed_at":"2024-10-10T20:21:55.317Z","dependency_job_id":"25550551-7194-4c25-a853-4f8c6ae4ae10","html_url":"https://github.com/Jannchie/simpyder","commit_stats":{"total_commits":70,"total_committers":3,"mean_commits":"23.333333333333332","dds":0.02857142857142858,"last_synced_commit":"304060737953de5bbae6a1383bc19074e8402aa7"},"previous_names":[],"tags_count":31,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jannchie%2Fsimpyder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jannchie%2Fsimpyder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jannchie%2Fsimpyder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jannchie%2Fsimpyder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Jannchie","download_url":"https://codeload.github.com/Jannchie/simpyder/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221664305,"owners_count":16860032,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","python","spider"],"created_at":"2024-10-10T20:21:48.500Z","updated_at":"2024-10-27T10:41:26.564Z","avatar_url":"https://github.com/Jannchie.png","language":"Python","funding_links":["https://github.com/sponsors/jannchie","https://azz.net/jannchie"],"categories":[],"sub_categories":[],"readme":"# Simpyder - Simple Python Spider\n\nSimpyder - 轻量级**协程**Python爬虫\n\n## 特点\n\n- 轻量级：下载便利，依赖较少，使用简单。\n- 协程：单线程，通过协程实现并发。\n- 可定制：简单配置，适应各种爬取场合。\n  \n## 快速开始\n\n### 下载\n\n```bash\n#使用pip3\npip3 install simpyder --user\n```\n\n```bash\n# 更新包\npip3 install simpyder --upgrade\n```\n\n### 编码\n\n用户只需要定义三个函数，实现三个模块：\n\n#### 链接获取\n\n我们需要一个定义一个[异步生成器](https://docs.python.org/zh-cn/3/c-api/gen.html)，用于产生链接。\n\n``` python\nasync def gen_url():\n    for each_id in range(100):\n        yield \"https://www.biliob.com/api/video/{}\".format(each_id)\n```\n\n#### 链接解析\n\n我们需要定义一个解析链接的函数。其中第一个参数是Response对象，也就是上述函数对应URL的访问结果。\n\n该函数需要返回一个对象，作为处理结果。\n\n注意，与普通函数不同，这是一个协程函数。需要在前面加上`async`。代表该函数是异步的。\n\n``` python\nasync def parse(response):\n    return response.xpath('//meta[@name=\"title\"]/@content')[0]\n```\n\n#### 数据导出\n\n上面函数的处理结果将在这个函数中统一被导出。下列例子为直接在控制台中打印导出结果。\n\n保存需要IO操作，因此这个函数可能运行较慢，因此也需要是异步的。我们在前面添加`async`关键词\n\n``` python\nasync def save(item):\n    print(item)\n```\n\n### 然后将这些模块组成一个Spider\n\n首先导入爬虫对象:\n\n``` python\nimport AsynSpider from simpyder.spiders\n```\n\n你可以这样组装Spider\n\n``` python\nspider = AsyncSpider()\nspider.gen_url = gen_url\nspider.parse = parse\nspider.save = save\n```\n\n### 接着就可以开始爬虫任务\n\n``` python\ns.run()\n```\n\n### 你也可以通过构造函数进行一些配置\n\n``` python\n\nspider = AsyncSpider(name=\"TEST\")\n```\n\n## 示例程序\n\n``` python\nfrom simpyder.spiders import AsynSpider\n\n# new一个异步爬虫\ns = AsynSpider()\n\n# 定义链接生成的生成器，这里是爬取800次百度首页的爬虫\ndef g():\n  count = 0\n  while count \u003c 800:\n    count += 1\n    yield \"https://www.baidu.com\"\n\n# 绑定生成器\ns.gen_url = g\n\n# 定义用于解析的异步函数，这里不进行任何操作，返回一段文本\nasync def p(res):\n  return \"parsed item\"\n\n# 绑定解析器\ns.parse = p\n\n# 定义用于存储的异步函数，这里不进行任何操作，但是返回2，表示解析出2个对象\nasync def s(item):\n  return 2\n\n# 绑定存储器\ns.save = s\n\n# 运行\ns.run()\n\n```\n\n## 理论速率\n\n运行上述代码，可以得到单进程、并发数：64、仅进行计数操作的下载速率：\n\n``` log\n[2020-09-02 23:42:48,097][CRITICAL] @ Simpyder: user_agent: Simpyder ver.0.1.9\n[2020-09-02 23:42:48,169][CRITICAL] @ Simpyder: concurrency: 64\n[2020-09-02 23:42:48,244][CRITICAL] @ Simpyder: interval: 0\n[2020-09-02 23:42:48,313][INFO] @ Simpyder: 已经爬取0个链接(0/min)，共产生0个对象(0/min) \n[2020-09-02 23:42:48,319][INFO] @ Simpyder: Start Crawler: 0\n[2020-09-02 23:42:53,325][INFO] @ Simpyder: 已经爬取361个链接(4332/min)，共产生658个对象(7896/min) \n[2020-09-02 23:42:58,304][INFO] @ Simpyder: 已经爬取792个链接(5280/min)，共产生1540个对象(10266/min) \n[2020-09-02 23:43:03,304][INFO] @ Simpyder: 已经爬取1024个链接(4388/min)，共产生2048个对象(8777/min) \n[2020-09-02 23:43:05,007][CRITICAL] @ Simpyder: Simpyder任务执行完毕\n[2020-09-02 23:43:05,008][CRITICAL] @ Simpyder: 累计消耗时间：0:00:16.695013\n[2020-09-02 23:43:05,008][CRITICAL] @ Simpyder: 累计爬取链接：1024\n[2020-09-02 23:43:05,009][CRITICAL] @ Simpyder: 累计生成对象：2048\n```\n\n---\n\n- 该项目由[@Jannchie](https://github.com/Jannchie)维护\n- 你可以通过邮箱[jannchie@gmail.com](jannchie@gmail.com)进行联系","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjannchie%2Fsimpyder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjannchie%2Fsimpyder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjannchie%2Fsimpyder/lists"}