https://github.com/beanwei/scrapy-proxies
Scrapy proxy pools
https://github.com/beanwei/scrapy-proxies
proxypool scrapy
Last synced: 4 months ago
JSON representation
Scrapy proxy pools
- Host: GitHub
- URL: https://github.com/beanwei/scrapy-proxies
- Owner: BeanWei
- License: mit
- Created: 2018-05-11T12:53:41.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-05-12T12:30:33.000Z (about 7 years ago)
- Last Synced: 2025-01-15T05:44:50.278Z (5 months ago)
- Topics: proxypool, scrapy
- Language: Python
- Size: 15.6 KB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scrapy-proxies
一个简单的动态代理池,通过较高频率的自检保证池内代理的高可靠性。
# 代码结构
代码分三个部分:
* 一个scrapy爬虫去爬代理网站,获取免费代理,验证后入库 (proxy_fetch)
* 一个scrapy爬虫把代理池内的代理全部验证一遍,若验证失败就从代理池内删除 (proxy_check)
* 一个调度程序用于管理上面两个爬虫 (start.py)
# 部署
需要先改一下配置文件hq-proxies.yml
在配置文件中也可以调整相应的阈值和免费代理源和测试页面。```python
#中间件
import redis
import randomclass DynamicProxyMiddleware(object):
def process_request(self, request, spider):
redis_db = redis.StrictRedis(
host= '127.0.0.1',
port= 6379,
password= '',
db= 6
)
proxy = random.choice(list(redis_db.smembers("hq-proxies:proxy_pool"))).decode('utf-8')
spider.logger.debug('使用代理[%s]访问[%s]' % (proxy, request.url))
request.meta['proxy'] = proxy
```