{"id":17337378,"url":"https://github.com/anycodes/developer-spider","last_synced_at":"2025-03-27T08:11:57.722Z","repository":{"id":144049648,"uuid":"515819554","full_name":"anycodes/developer-spider","owner":"anycodes","description":"阿里云开发者社区爬虫","archived":false,"fork":false,"pushed_at":"2022-07-20T03:41:59.000Z","size":12,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-01T13:11:20.587Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anycodes.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-20T03:26:23.000Z","updated_at":"2022-07-20T03:27:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"fa263d31-a6c1-4c89-8fdf-6b6c0edcd5e0","html_url":"https://github.com/anycodes/developer-spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anycodes%2Fdeveloper-spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anycodes%2Fdeveloper-spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anycodes%2Fdeveloper-spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anycodes%2Fdeveloper-spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anycodes","download_url":"https://codeload.github.com/anycodes/developer-spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245806458,"owners_count":20675298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T15:34:47.505Z","updated_at":"2025-03-27T08:11:57.696Z","avatar_url":"https://github.com/anycodes.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# developer-spider 帮助文档\n\n\u003cp align=\"center\" class=\"flex justify-center\"\u003e\n    \u003ca href=\"https://www.serverless-devs.com\" class=\"ml-1\"\u003e\n    \u003cimg src=\"http://editor.devsapp.cn/icon?package=developer-spider\u0026type=packageType\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"http://www.devsapp.cn/details.html?name=developer-spider\" class=\"ml-1\"\u003e\n    \u003cimg src=\"http://editor.devsapp.cn/icon?package=developer-spider\u0026type=packageVersion\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"http://www.devsapp.cn/details.html?name=developer-spider\" class=\"ml-1\"\u003e\n    \u003cimg src=\"http://editor.devsapp.cn/icon?package=developer-spider\u0026type=packageDownload\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdescription\u003e\n\n\u003e ***阿里云开发者社区爬虫***\n\n\u003c/description\u003e\n\n\u003ctable\u003e\n\n## 前期准备\n使用该项目，推荐您拥有以下的产品权限 / 策略：\n\n| 服务/业务 | 函数计算 |     \n| --- |  --- |   \n| 权限/策略 | AliyunFCFullAccess |     \n\n\n\u003c/table\u003e\n\n\u003ccodepre id=\"codepre\"\u003e\n\n\n\n\u003c/codepre\u003e\n\n\u003cdeploy\u003e\n\n## 部署 \u0026 体验\n\n\u003cappcenter\u003e\n\n- :fire: 通过 [Serverless 应用中心](https://fcnext.console.aliyun.com/applications/create?template=developer-spider) ，\n[![Deploy with Severless Devs](https://img.alicdn.com/imgextra/i1/O1CN01w5RFbX1v45s8TIXPz_!!6000000006118-55-tps-95-28.svg)](https://fcnext.console.aliyun.com/applications/create?template=developer-spider)  该应用。 \n\n\u003c/appcenter\u003e\n\n- 通过 [Serverless Devs Cli](https://www.serverless-devs.com/serverless-devs/install) 进行部署：\n    - [安装 Serverless Devs Cli 开发者工具](https://www.serverless-devs.com/serverless-devs/install) ，并进行[授权信息配置](https://www.serverless-devs.com/fc/config) ；\n    - 初始化项目：`s init developer-spider -d developer-spider`   \n    - 进入项目，并进行项目部署：`cd developer-spider \u0026\u0026 s deploy -y`\n\n\u003c/deploy\u003e\n\n\u003cappdetail id=\"flushContent\"\u003e\n\n# 应用详情\n\n本应用是基于 Python 语言的爬虫案例，主要包括：\n- 获取随机头\n- 建立代理IP池\n- 删除代理IP\n- 获取代理IP     \n\n## 获取随机头\n\n通常情况下，反爬虫系统会校验请求头信息，在请求头信息中最常校验的就是`User-Agent`，所以在本方法中，会随即返回一个`User-Agent`。如果在使用过程中，已经列举的`User-Agent`无法满足需求，可以额外添加。\n\n\u003e tips：`User-Agent`不仅仅单纯的应对反爬虫的时候会有用，往往也会降低我们的数据采集难度，例如有一些网站手机端`User-Agent`请求时所触发的反爬虫策略等级会远小于电脑版，所以`User-Agent`在一定程度上也可以用来切换客户端类型。\n\n## 代理相关\n\n由于代理IP在一定程度上是需要付费进行使用的，所以本案例所采用的代理IP部分仅供学习和参考。\n\n本案例的代理IP服务商来自阿里云云市场：https://market.aliyun.com/products/57126001/cmapi00037885.html\n\n开发者可以根据自己的需求对这一部分的代理IP获取方法进行完善。\n\n\u003e tips：本文所采用的代理IP使用策略是，当前IP失效后，清理掉失效IP，再更换代理IP，当然这个策略并不一定适合全部的数据采集情况，例如某些网站的反爬虫策略是IP限频，那么此时如果想要突破频率，可以采用的是每次更换代理IP，或同一链路请求完成更换代理IP，代理IP不清理并且循环利用；\n\n## 主方法的注意事项\n\n###  循环条件\n\n循环条件，此处案例1到10，用来进行页码的循环，但是在实际爬虫过程中可能有其他的方法：\n\n1. 根据返回的数据页面进行循环；\n2. 根据返回的数据个数，决定是否要继续循环操作；\n3. 更具已有的列表决定是否要循环\n\n当然还有其他的很多循环条件，此处可以根据实际需要自行修改\n\n### 切换IP的判断条件\n\n在代码中虚拟了一个逻辑分支，用于为用户铺垫切换IP/切换UA/删除IP的条件：例如 response 出现了某个指定的字符串，需要对现有的IP进行删除，并切换IP和UA\n\n```\nif 'xxxxx' in response:\n    proxy = getProxy()\n    headers[\"User-Agent\"] = getUserAgent()\n    response_status = False\n    # 触发重试逻辑，进行重试\n    continue\n```\n\n### 数据的下游处理\n\n数据的下游处理方法在本文中并没有提及，通常情况下会将数据存放在MongoDB等数据库进行持久化，或将数据转到下游清洗逻辑进行数据清洗等相关的操作。\n\n## 定时器与测试\n\n当前案例采用的是定时任务，当然，在实际生产中可能出现触发式爬虫，例如像OSS写入数据进行数据采集，或者通过url进行数据采集。这一部分可以根据项目实际需求进行更改。\n\n项目中定时任务配置：\n\n```\ntriggers:\n  - name: timer\n    type: timer\n    config:\n      cronExpression: '@every 100m'\n      enable: true\n```\n\n部署完成后，可以点击函数：\n\n![](http://image.editor.devsapp.cn/evBw7lh8ktv6xDBzSSzvjr1ykchAF9hG41gf1ek1sk8tr4355A/zraBABufG3ta2AGrzA82)\n\n进入到函数查看页面，此时可以点击运行查看测试效果：\n\n![](http://image.editor.devsapp.cn/evBw7lh8ktv6xDBzSSzvjr1ykchAF9hG41gf1ek1sk8tr4355A/ig4E84uS4twBEtuz6vkZ)\n\n\n\n\n\n\n\n\n\u003c/appdetail\u003e\n\n\u003cdevgroup\u003e\n\n## 开发者社区\n\n您如果有关于错误的反馈或者未来的期待，您可以在 [Serverless Devs repo Issues](https://github.com/serverless-devs/serverless-devs/issues) 中进行反馈和交流。如果您想要加入我们的讨论组或者了解 FC 组件的最新动态，您可以通过以下渠道进行：\n\n\u003cp align=\"center\"\u003e\n\n| \u003cimg src=\"https://serverless-article-picture.oss-cn-hangzhou.aliyuncs.com/1635407298906_20211028074819117230.png\" width=\"130px\" \u003e | \u003cimg src=\"https://serverless-article-picture.oss-cn-hangzhou.aliyuncs.com/1635407044136_20211028074404326599.png\" width=\"130px\" \u003e | \u003cimg src=\"https://serverless-article-picture.oss-cn-hangzhou.aliyuncs.com/1635407252200_20211028074732517533.png\" width=\"130px\" \u003e |\n|--- | --- | --- |\n| \u003ccenter\u003e微信公众号：`serverless`\u003c/center\u003e | \u003ccenter\u003e微信小助手：`xiaojiangwh`\u003c/center\u003e | \u003ccenter\u003e钉钉交流群：`33947367`\u003c/center\u003e | \n\n\u003c/p\u003e\n\n\u003c/devgroup\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanycodes%2Fdeveloper-spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanycodes%2Fdeveloper-spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanycodes%2Fdeveloper-spider/lists"}