{"id":21077733,"url":"https://github.com/howie6879/hproxy","last_synced_at":"2025-05-16T08:31:08.238Z","repository":{"id":31598774,"uuid":"128378383","full_name":"howie6879/hproxy","owner":"howie6879","description":"hproxy - Asynchronous IP proxy pool, aims to make getting proxy as convenient as possible.(异步爬虫代理池)","archived":false,"fork":false,"pushed_at":"2021-12-13T19:42:01.000Z","size":126,"stargazers_count":66,"open_issues_count":5,"forks_count":14,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-03T22:11:16.236Z","etag":null,"topics":["asyncio","crawler","crawlers","hproxy","proxy","proxy-pool","proxy-spider","sanic","schedule"],"latest_commit_sha":null,"homepage":"https://hproxy.htmlhelper.org/api","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/howie6879.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-06T10:00:27.000Z","updated_at":"2024-11-22T13:33:51.000Z","dependencies_parsed_at":"2022-08-28T03:30:40.648Z","dependency_job_id":null,"html_url":"https://github.com/howie6879/hproxy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Fhproxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Fhproxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Fhproxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/howie6879%2Fhproxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/howie6879","download_url":"https://codeload.github.com/howie6879/hproxy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254496101,"owners_count":22080651,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","crawler","crawlers","hproxy","proxy","proxy-pool","proxy-spider","sanic","schedule"],"created_at":"2024-11-19T19:38:04.372Z","updated_at":"2025-05-16T08:31:03.230Z","avatar_url":"https://github.com/howie6879.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Hproxy - 异步IP代理池\n\n[![Build Status](https://travis-ci.org/howie6879/hproxy.svg?branch=master)](https://travis-ci.org/howie6879/hproxy) [![Python](https://img.shields.io/badge/python-3.6%2B-orange.svg)](https://github.com/howie6879/hproxy) [![license](https://img.shields.io/github/license/howie6879/hproxy.svg)](https://github.com/howie6879/hproxy) \n\n### 概述\n\n本项目利用第三方IP代理提供站定时抓取有效IP，并免费提供网页源数据抓取方案，构建异步IP代理池，让你尽可能简单地获取有效代理：\n\n- Demo: https://hproxy.htmlhelper.org/api\n- Document：[中文](README.md) | [English](./README_EN.md)\n- Deploy: 部署文档见[这里](./docs/deploy.md)\n\n### 开始\n\n本项目基于Python3.6+，利用Sanic构建异步HTTP服务，利用`aiohttp`进行代理数据异步抓取\n\n#### 单机运行\n\n``` shell\ngit clone https://github.com/howie6879/hproxy.git\ncd hproxy\npip install pipenv\n\n# 这里需要注意，虚拟环境请使用Python3.6+，安装依赖库\npipenv install\n\ncd hproxy\npython server.py\n\n# 启动爬虫 运行 /hproxy/hproxy/spider/spider_console.py\n# 访问：127.0.0.1/api/\n```\n\nhproxy默认使用Redis进行数据存储服务，所以使用的前提是安装好Redis，具体配置在`config`下：\n\n``` python\n# Database config\nREDIS_DICT = dict(\n    REDIS_ENDPOINT=os.getenv('REDIS_ENDPOINT', \"localhost\"),\n    REDIS_PORT=os.getenv('REDIS_PORT', 6379),\n    REDIS_DB=os.getenv('REDIS_DB', 0),\n    REDIS_PASSWORD=os.getenv('REDIS_PASSWORD', None)\n)\nDB_TYPE = 'redis'\n```\n\n如果想使用机器本身的`Memory`，直接在`config`里将`DB_TYPE = 'redis'`更改为`DB_TYPE = 'memory'`\n\n这里需要注意的是如果使用`memory`模式，那么服务停止了数据也随之丢失，推荐使用`redis`模式\n\n如果想使用其他方式进行数据存储，只需根据[BaseDatabase](https://github.com/howie6879/hproxy/blob/master/hproxy/database/base_database.py)的编码规范进行扩展即可\n\n### 特性\n\n- [x] 多种方式进行数据存储，易扩展：\n  - [DatabaseSetting](https://github.com/howie6879/hproxy/blob/master/hproxy/database/db_setting.py)\n  - [Memory](https://github.com/howie6879/hproxy/blob/master/hproxy/database/backends/memory_database.py)\n  - [Redis](https://github.com/howie6879/hproxy/blob/master/hproxy/database/backends/redis_database.py)\n- [x] 自定义爬虫基础部件，上手简单，统一代码风格：\n  - [Field](https://github.com/howie6879/hproxy/blob/master/hproxy/spider/base/field.py)\n  - [Item](https://github.com/howie6879/hproxy/blob/master/hproxy/spider/base/item.py)\n- [x] 提供API获取代理，启动后访问 `127.0.0.1:8001/api`\n  - 'delete/:proxy': 删除代理\n  - 'get': 随机选择一个代理\n  - 'list':列出全部代理\n  - ...\n\n- [x] 从代理池随机选取一个代理提供html源码抓取服务\n- [x] 定时抓取、更新、自动验证\n- [ ] 获取代理具体信息：如代理类型、协议、位置\n\n### 功能描述\n\n#### 代理获取\n\n本项目的爬虫代码全部集中于目录[spider](https://github.com/howie6879/hproxy/tree/master/hproxy/spider)，在[/spider/proxy_spider/](https://github.com/howie6879/hproxy/tree/master/hproxy/spider/proxy_spider)目录下定义了一系列代理网站的爬虫，所有爬虫基于[/spider/base/proxy_spider.py](https://github.com/howie6879/hproxy/blob/master/hproxy/spider/base/proxy_spider.py)里定义的规范编写，参考这些，就可以很方便的扩展一系列代理爬虫\n\n运行[spider_console.py](https://github.com/howie6879/hproxy/blob/master/hproxy/spider/spider_console.py)文件，即可启动全部爬虫进行代理的获取，无需定义新加的爬虫脚本，只需按照规范命名，即可自动获取爬虫模块然后运行\n\n若想运行单个代理爬虫脚本，直接运行即可，比如`xicidaili`，直接执行：\n\n``` shell\ncd hproxy/hproxy/spider/proxy_spider/\npython xicidaili_spider.py\n\n# 验证100个代理，异步执行能保证5秒左右执行完毕，因为超时代理超时就是5s\n# 同步执行最坏情况就...\n# 2018/04/14 13:42:32 [爬虫执行结束  ] OK 爬虫：xicidaili 执行结束，获取代理100个 - 有效代理：28个，用时：5.384464740753174 \n```\n\n#### 代理验证\n\n获取的代理验证脚本在[valid_proxy](https://github.com/howie6879/hproxy/blob/master/hproxy/scheduler/valid_proxy.py)，目前设定每10分钟验证一次所有代理，每个代理失败五次之后就丢弃，一般在后台运行，手动执行如下：\n\n``` shell\ncd hproxy/hproxy/scheduler/\npython valid_proxy.py\n```\n\n#### 代理接口\n\n| 接口                         | 描述                                                         |\n| :--------------------------- | :----------------------------------------------------------- |\n| delete/:proxy                | 删除一个代理                                                 |\n| get                          | 参数valid=1，会在返回代理过程中验证一次，确保其有效，否则一直寻找，直到返回 |\n| list                         | 列出所有代理，没有一个个验证                                 |\n| valid/:proxy                 | 验证一个代理                                                 |\n| html?url=''\u0026ajax=0\u0026foreign=0 | 随机选取代理请求url并返回                                    |\n\n``` json\n// http://127.0.0.1:8001/api/get?valid=1\n// 返回成功，开启验证参数valid=1的话speed会有值，并且默认是开启的\n// types 1:高匿 2:匿名 3:透明\n{\n    \"status\": 1,\n    \"info\": {\n        \"proxy\": \"101.37.79.125:3128\",\n        \"types\": 3\n    },\n    \"msg\": \"success\",\n    \"speed\": 2.4909408092\n}\n// http://127.0.0.1:8001/api/list 列出所有代理，没有一个个验证\n{\n    \"status\": 1,\n    \"info\": {\n        \"180.168.184.179:53128\": {\n            \"proxy\": \"180.168.184.179:53128\",\n            \"types\": 3\n        },\n        \"101.37.79.125:3128\": {\n            \"proxy\": \"101.37.79.125:3128\",\n            \"types\": 3\n        }\n    },\n    \"msg\": \"success\"\n}\n// http://127.0.0.1:8001/api/delete/171.39.45.6:8123\n{\n    \"status\": 1,\n    \"msg\": \"success\"\n}\n// http://127.0.0.1:8001/api/valid/183.159.91.75:18118\n{\n    \"status\": 1,\n    \"msg\": \"success\",\n    \"speed\": 0.3361008167\n}\n// http://127.0.0.1:8001/api/html?url=https://www.v2ex.com\n// 随机选取代理抓取v2ex\n{\n    \"status\": 1,\n    \"info\": {\n        \"html\": \"html 源码\",\n        \"proxy\": \"120.77.254.116:3128\"\n    },\n    \"msg\": \"success\"\n}\n```\n\n### FAQ\n\n问：为什么只抓取ip以及端口？\n\n答：因为网站上代理的信息不一定准确，所以需要进一步验证，本项目会在返回代理的时候做进行验证，验证是否可用以及验证代理具体信息\n\n问：如何扩展数据存储方式？\n\n答：[BaseDatabase](https://github.com/howie6879/hproxy/blob/master/hproxy/database/base_database.py)里面定义了一些子类必须要有的方法，按照这个格式写就不会有问题\n\n问：如何扩展代理爬虫？\n\n答：同样，在[spider](https://github.com/howie6879/hproxy/tree/master/hproxy/spider)目录下找到爬虫编写规范，或者直接看某一个代理爬虫脚本的编写模式\n\n### License\n\nhproxy is offered under the MIT license.\n\n### 参考\n\n感谢以下项目：\n\n- [IPProxyPool](https://github.com/qiyeboy/IPProxyPool)\n- [proxy_pool](https://github.com/jhao104/proxy_pool)\n\n感谢以下代理网站，有优质代理网站请提交^_^，点这里 [#3](https://github.com/howie6879/hproxy/issues/3)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhowie6879%2Fhproxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhowie6879%2Fhproxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhowie6879%2Fhproxy/lists"}