{"id":18284594,"url":"https://github.com/lixi5338619/asyncpy","last_synced_at":"2025-05-07T10:36:28.845Z","repository":{"id":37378910,"uuid":"266686480","full_name":"lixi5338619/asyncpy","owner":"lixi5338619","description":"使用asyncio和aiohttp开发的轻量级异步协程web爬虫框架","archived":false,"fork":false,"pushed_at":"2022-10-23T05:11:12.000Z","size":65,"stargazers_count":108,"open_issues_count":5,"forks_count":28,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-15T07:43:28.563Z","etag":null,"topics":["aiohttp","asyncio","asyncpy","crawler","python","scrapy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lixi5338619.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-25T04:55:57.000Z","updated_at":"2025-03-12T07:31:49.000Z","dependencies_parsed_at":"2022-08-28T01:11:17.171Z","dependency_job_id":null,"html_url":"https://github.com/lixi5338619/asyncpy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Fasyncpy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Fasyncpy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Fasyncpy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lixi5338619%2Fasyncpy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lixi5338619","download_url":"https://codeload.github.com/lixi5338619/asyncpy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252860615,"owners_count":21815540,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aiohttp","asyncio","asyncpy","crawler","python","scrapy"],"created_at":"2024-11-05T13:14:08.474Z","updated_at":"2025-05-07T10:36:28.814Z","avatar_url":"https://github.com/lixi5338619.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# asyncpy\nUse asyncio and aiohttp's concatenated web crawler framework  \n\n\u003cimg src=\"https://img-blog.csdnimg.cn/20200523121741871.png?x-oss-process=image/resize,m_fixed,h_224,w_224\"/\u003e\n\n\nAsyncpy是我基于asyncio和aiohttp开发的一个轻便高效的爬虫框架，采用了scrapy的设计模式，参考了github上一些开源框架的处理逻辑。\n\n---\n\n## 更新事项\n\n- 1.1.7： 修复事件循环结束时的报错问题\n- 1.1.8： 在spider文件中不再需要手动导入settings_attr\n\n\n- - -\n使用文档 : [https://blog.csdn.net/weixin_43582101/article/details/106320674](https://blog.csdn.net/weixin_43582101/article/details/106320674)\n\n应用案例 : [https://blog.csdn.net/weixin_43582101/category_10035187.html](https://blog.csdn.net/weixin_43582101/category_10035187.html)\n\ngithub: [https://github.com/lixi5338619/asyncpy](https://github.com/lixi5338619/asyncpy)\n\npypi:  [https://pypi.org/project/asyncpy/](https://pypi.org/project/asyncpy/)\n\n![在这里插入图片描述](https://img-blog.csdnimg.cn/20200521150905651.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzU4MjEwMQ==,size_16,color_FFFFFF,t_70)\n\n**asyncpy的架构及流程**\n\n![](https://img-blog.csdnimg.cn/20200523130546527.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzU4MjEwMQ==,size_16,color_FFFFFF,t_70)\n\n---\n## 安装需要的环境\npython版本需要 \u003e=3.6\n依赖包： [ 'lxml', 'parsel','docopt', 'aiohttp']\n\n**安装命令：**\n```python\npip install asyncpy\n```\n**如果安装报错：**\n```\nERROR: Could not find a version that satisfies the requirement asyncpy (from versions: none)\nERROR: No matching distribution found for asyncpy\n```\n请查看你当前的python版本，python版本需要3.6以上。\n\n还无法下载的话，可以到 [https://pypi.org/project/asyncpy/](https://pypi.org/project/asyncpy/) 下载最新版本的 whl 文件。  \n点击Download files，下载完成之后使用cmd安装： \npip install asyncpy-版本-py3-none-any.whl \n\n- - - \n\n### 创建一个爬虫文件\n在命令行输入asyncpy --version 查看是否成功安装。\n\n创建demo文件，使用cmd命令：\n\n```python\nasyncpy genspider demo\n```\n\n- - -\n### 全局settings\n| settings配置 | 简介 |\n|--|--|\n| CONCURRENT_REQUESTS | 并发数量 |\n|    RETRIES          |       重试次数|\n|    DOWNLOAD_DELAY       |   下载延时|\n|RETRY_DELAY          |   重试延时|\n|DOWNLOAD_TIMEOUT    |    超时限制|\n|USER_AGENT           |   用户代理|\n|LOG_FILE              |  日志路径|\n|LOG_LEVEL              | 日志等级|\n|USER_AGENT|全局UA|\n|PIPELINES|管道|\n|MIDDLEWARE|中间件|\n\n\n1.1.8版本之前，如果要启动全局settings的话，需要在 spider文件中通过settings_attr 传入settings：\n```python\nimport settings\nclass DemoSpider(Spider):\n    name = 'demo'\n    start_urls = []\n    settings_attr = settings\n```\n\n**新版本中无需手动传入settings。**\n\n- - -\n### 自定义settings\n如果需要对单个爬虫文件进行settings配置，可以像scrapy一样在爬虫文件中引入 **custom_settings**。\n他与settings_attr 并不冲突。\n```python\nclass DemoSpider2(Spider):\n    name = 'demo2'\n\n    start_urls = []\n\n    concurrency = 30                                # 并发数量\n    \n    custom_settings = {\n        \"RETRIES\": 1,                               # 重试次数\n        \"DOWNLOAD_DELAY\": 0,                        # 下载延时\n        \"RETRY_DELAY\": 0,                           # 重试延时\n        \"DOWNLOAD_TIMEOUT\": 10,                     # 超时时间\n        \"LOG_FILE\":\"demo2.log\"\t\t\t\t\t\t# 日志文件\n            }\n```\n- - -\n### 生成日志文件\n在settings文件中，加入：\n```python\nLOG_FILE = './asyncpy.log'\nLOG_LEVEL = 'DEBUG'\n```\n如果需要对多个爬虫生成多个日志文件，\n需要删除settings中的日志配置，在custom_settings中重新进行配置。\n- - -\n### 自定义Middleware中间件\n在创建的 demo_middleware 文件中，增加新的功能。 \n可以根据 request.meta 和spider 的属性进行针对性的操作。\n```python\nfrom asyncpy.middleware import Middleware\n\nmiddleware = Middleware()\n\n@middleware.request\nasync def UserAgentMiddleware(spider, request):\n    if request.meta.get('valid'):\n        print(\"当前爬虫名称:%s\"%spider.name)\n        ua = \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36\"\n        request.headers.update({\"User-Agent\": ua})\n\n\n@middleware.request\nasync def ProxyMiddleware(spider, request):\n    if spider.name == 'demo':\n        request.aiohttp_kwargs.update({\"proxy\": \"http://123.45.67.89:0000\"})\n```\n**方法1、去settings文件中开启管道。**（版本更新，暂时请选择2方法）\n```python\nMIDDLEWARE = [\n'demo_middleware.middleware',\n            ]\n```\n**方法2、在start()传入middleware:** \n```python\nfrom middlewares import middleware\nDemoSpider.start(middleware=middleware)\n```\n- - -\n### 自定义Pipelines管道\n如果你定义了item(目前只支持dict字典格式的item)，并且settings 里面 启用了pipelines 那么你就可以在pipelines 里面 编写 连接数据库，插入数据的代码。\n**在spider文件中：**\n```python\n\t item = {}\n\t item['response'] = response.text\n\t item['datetime'] = '2020-05-21 13:14:00'\n\t yield item\n```\n**在pipelines.py文件中：**\n```python\nclass SpiderPipeline():\n\n    def __init__(self):\n        pass\n\n    def process_item(self, item, spider_name):\n        pass\n```\n\n**方法1、settings中开启管道：**（版本更新，暂时请选择2方法）\n```python\nPIPELINES = [\n'pipelines.SpiderPipeline',\n            ]\n```\n**方法2、在start()传入pipelines:** \n```python\nfrom pipelines import SpiderPipeline\nDemoSpider.start(pipelines=SpiderPipeline)\n```\n- - -\n### Post请求 重写start_requests\n如果需要直接发起 post请求，可以删除 **start_urls** 中的元素，重新 start_requests 方法。\n- - -\n### 解析response\n采用了scrapy中的解析库parse，解析方法和scrapy一样，支持xpath，css选择器，re。\n\n简单示例:\nxpath(\"//div[id = demo]/text()\").get() \t\t----- 获取第一个元素\n\nxpath(\"//div[id = demo]/text()\").getall()\t   ----- 获取所有元素，返回list\n- - -\n\n### 启动爬虫\n在spider文件中通过 类名.start()启动爬虫。\n比如爬虫的类名为DemoSpider\n```python\nDemoSpider.start()\n```\n - - -\n### 启动多个爬虫\n这里并没有进行完善，可以采用多进程的方式进行测试。\n```python\nfrom Demo.demo import DemoSpider\nfrom Demo.demo2 import DemoSpider2\nimport multiprocessing\n\ndef open_DemoSpider2():\n    DemoSpider2.start()\n\ndef open_DemoSpider():\n    DemoSpider.start()\n\nif __name__ == \"__main__\":\n    p1 = multiprocessing.Process(target = open_DemoSpider)\n    p2 = multiprocessing.Process(target = open_DemoSpider2)\n    p1.start()\n    p2.start()\n```\n\n\n- - -\n**特别致谢**  : Scrapy、Ruia、Looter、asyncio、aiohttp\n- - - \n\n感兴趣 [github](https://github.com/lixi5338619/asyncpy) 点个star吧 ，感谢大家！\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flixi5338619%2Fasyncpy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flixi5338619%2Fasyncpy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flixi5338619%2Fasyncpy/lists"}