{"id":13468007,"url":"https://github.com/SpiderClub/haipproxy","last_synced_at":"2025-03-26T03:31:25.546Z","repository":{"id":39722312,"uuid":"103733273","full_name":"SpiderClub/haipproxy","owner":"SpiderClub","description":":sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis","archived":false,"fork":false,"pushed_at":"2022-12-26T11:50:58.000Z","size":1215,"stargazers_count":5439,"open_issues_count":46,"forks_count":912,"subscribers_count":206,"default_branch":"master","last_synced_at":"2024-10-29T14:55:48.878Z","etag":null,"topics":["crawler","distributed","high-availability","ipproxy","redis","scheduler","scrapy","spider"],"latest_commit_sha":null,"homepage":"https://spiderclub.github.io/haipproxy/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SpiderClub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-16T07:14:27.000Z","updated_at":"2024-10-28T12:14:53.000Z","dependencies_parsed_at":"2023-01-31T00:01:10.440Z","dependency_job_id":null,"html_url":"https://github.com/SpiderClub/haipproxy","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SpiderClub%2Fhaipproxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SpiderClub%2Fhaipproxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SpiderClub%2Fhaipproxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SpiderClub%2Fhaipproxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SpiderClub","download_url":"https://codeload.github.com/SpiderClub/haipproxy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245584843,"owners_count":20639632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","distributed","high-availability","ipproxy","redis","scheduler","scrapy","spider"],"created_at":"2024-07-31T15:01:04.081Z","updated_at":"2025-03-26T03:31:23.531Z","avatar_url":"https://github.com/SpiderClub.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# 高可用IP代理池\n[README](README_EN.md)　｜　[中文文档](README.md)\n\n本项目所采集的IP资源都来自互联网，愿景是为大型爬虫项目提供一个**高可用低延迟的高匿IP代理池**。\n\n# 项目亮点\n- 代理来源丰富\n- 代理抓取提取精准\n- 代理校验严格合理\n- 监控完备，鲁棒性强\n- 架构灵活，便于扩展\n- 各个组件分布式部署\n\n# 快速开始\n\n注意，代码请在[release](https://github.com/SpiderClub/haipproxy/releases)列表中下载，**master**分支的代码不保证能稳定运行\n\n## 单机部署\n\n### 服务端\n- 安装Python3和Redis。有问题可以阅读[这篇文章](https://github.com/SpiderClub/weibospider/wiki/%E5%88%86%E5%B8%83%E5%BC%8F%E7%88%AC%E8%99%AB%E7%8E%AF%E5%A2%83%E9%85%8D%E7%BD%AE)的相关部分。\n- 根据Redis的实际配置修改项目配置文件[config/settings.py](config/settings.py)中的`REDIS_HOST`、`REDIS_PASSWORD`等参数。\n- 安装[scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash)，并修改配置文件[config/settings.py](config/settings.py)中的`SPLASH_URL`\n- 安装项目相关依赖\n  \u003e pip install -r requirements.txt\n- 启动*scrapy worker*，包括代理IP采集器和校验器\n  \u003e python crawler_booter.py --usage crawler\n\n  \u003e python crawler_booter.py --usage validator\n- 启动*调度器*，包括代理IP定时调度和校验\n  \u003e python scheduler_booter.py --usage crawler\n\n  \u003e python scheduler_booter.py --usage validator\n\n### 客户端\n近日不断有同学问，如何获取该项目中可用的代理IP列表。`haipproxy`提供代理的方式并不是通过`api api`来提供，而是通过具体的客户端来提供。\n目前支持的是[Python客户端](client/py_cli.py)和语言无关的[squid二级代理](client/squid.py)\n\n#### python客户端调用示例 \n```python3\nfrom client.py_cli import ProxyFetcher\nargs = dict(host='127.0.0.1', port=6379, password='123456', db=0)\n＃　这里`zhihu`的意思是，去和`zhihu`相关的代理ip校验队列中获取ip\n＃　这么做的原因是同一个代理IP对不同网站代理效果不同\nfetcher = ProxyFetcher('zhihu', strategy='greedy', redis_args=args)\n# 获取一个可用代理\nprint(fetcher.get_proxy())\n# 获取可用代理列表\nprint(fetcher.get_proxies()) # or print(fetcher.pool)\n```\n\n更具体的示例见[examples/zhihu](examples/zhihu/zhihu_spider.py)\n\n#### squid作为二级代理\n- 安装squid，备份squid的配置文件并且启动squid，以ubuntu为例\n  \u003e sudo apt-get install squid\n\n  \u003e sudo sed -i 's/http_access deny all/http_access allow all/g' /etc/squid/squid.conf\n\n  \u003e sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup\n\n  \u003e sudo service squid start\n- 根据操作系统修改项目配置文件[config/settings.py](config/settings.py)中的`SQUID_BIN_PATH`、`SQUID_CONF_PATH`、`SQUID_TEMPLATE_PATH`等参数\n- 启动`squid conf`的定时更新程序\n  \u003e sudo python squid_update.py\n- 使用squid作为代理中间层请求目标网站,默认代理URL为'http://squid_host:3128',用Python请求示例如下\n  ```python3\n  import requests\n  proxies = {'https': 'http://127.0.0.1:3128'}\n  resp = requests.get('https://httpbin.org/ip', proxies=proxies)\n  print(resp.text)\n  ```\n   \n## Docker部署\n- 安装Docker\n\n- 安装*docker-compose*\n  \u003e pip install -U docker-compose\n\n- 修改[settings.py](config/settings.py)中的`SPLASH_URL`和`REDIS_HOST`参数\n  ```python3\n  # 注意，如果您使用master分支下的代码，这步可被省略\n  SPLASH_URL = 'http://splash:8050'\n  REDIS_HOST = 'redis'\n  ```\n- 使用*docker-compose*启动各个应用组件\n  \u003e docker-compose up\n\n这种方式会一同部署`squid`，您可以通过`squid`调用代理IP池，也可以使用客户端调用，和单机部署调用方式一样\n\n# 注意事项\n- 本项目高度依赖Redis，除了消息通信和数据存储之外，IP校验和任务定时工具也使用了Redis中的多种数据结构。\n如果需要替换Redis，请自行度量\n- 由于*GFW*的原因，某些网站需要通过科学上网才能进行访问和采集，如果用户无法访问墙外的网站，请将[rules.py](config/rules.py)\n`task_queue`为` SPIDER_GFW_TASK`和`SPIDER_AJAX_GFW_TASK`的任务`enable`属性设置为0或者启动爬虫的时候指定爬虫类型为`common`和\n`ajax`\n  \u003e python crawler_booter.py --usage crawler common ajax\n- 相同代理IP，对于不同网站的代理效果可能大不相同。如果通用代理无法满足您的需求，您可以[为特定网站编写代理IP校验器](https://github.com/SpiderClub/haipproxy/blob/master/docs/%E9%92%88%E5%AF%B9%E7%89%B9%E5%AE%9A%E7%AB%99%E7%82%B9%E6%B7%BB%E5%8A%A0%E6%A0%A1%E9%AA%8C%E5%99%A8.md)\n\n# 工作流程\n![](static/workflow.png)\n\n# 效果测试\n以单机模式部署`haipproxy`和[测试代码](examples/zhihu/zhihu_spider.py)，以知乎为目标请求站点，实测抓取效果如下\n\n![](./static/zhihu.png)\n\n测试代码见[examples/zhihu](examples/zhihu/zhihu_spider.py)\n\n# 项目监控(可选)\n项目监控主要通过[sentry](https://sentry.io/welcome/)和[prometheus](https://prometheus.io/),通过在关键地方\n进行业务埋点对项目各个维度进行监测，以提高项目的鲁棒性\n\n项目使用[Sentry](https://sentry.io/welcome/)作`Bug Trace`工具，通过Sentry可以很容易跟踪项目健康情况\n\n![](./static/bug_trace.jpg)\n\n\n使用[Prometheus](https://prometheus.io/)+[Grafana](https://grafana.com/)做业务监控，了解项目当前状态\n\n![](./static/monitor.png)\n\n# 捐赠作者\n开源不易，如果本项目对您有用，不妨进行小额捐赠，以支持项目的持续维护\n\n![](./static/donate.jpg)\n\n# 同类项目\n本项目参考了Github上开源的各个爬虫代理的实现，感谢他们的付出，下面是笔者参考的所有项目，排名不分先后。\n\n[dungproxy](https://github.com/virjar/dungproxy)\n\n[proxyspider](https://github.com/zhangchenchen/proxyspider)\n\n[ProxyPool](https://github.com/henson/ProxyPool)\n\n[proxy_pool](https://github.com/jhao104/proxy_pool)\n\n[ProxyPool](https://github.com/WiseDoge/ProxyPool)\n\n[IPProxyTool](https://github.com/awolfly9/IPProxyTool)\n\n[IPProxyPool](https://github.com/qiyeboy/IPProxyPool)\n\n[proxy_list](https://github.com/gavin66/proxy_list)\n\n[proxy_pool](https://github.com/lujqme/proxy_pool)\n\n[ProxyPool](https://github.com/fengzhizi715/ProxyPool)\n\n[scylla](https://github.com/imWildCat/scylla)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSpiderClub%2Fhaipproxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSpiderClub%2Fhaipproxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSpiderClub%2Fhaipproxy/lists"}