Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kohn/HttpProxyMiddleware
A middleware for scrapy. Used to change HTTP proxy from time to time.
https://github.com/kohn/HttpProxyMiddleware
Last synced: 25 days ago
JSON representation
A middleware for scrapy. Used to change HTTP proxy from time to time.
- Host: GitHub
- URL: https://github.com/kohn/HttpProxyMiddleware
- Owner: kohn
- License: mit
- Created: 2015-09-29T05:37:18.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2018-02-01T14:21:57.000Z (almost 7 years ago)
- Last Synced: 2024-08-03T22:06:23.101Z (4 months ago)
- Language: Python
- Size: 27.3 KB
- Stars: 328
- Watchers: 26
- Forks: 132
- Open Issues: 0
-
Metadata Files:
- Readme: README.org
- License: LICENSE
Awesome Lists containing this project
- awesome-scrapy - HttpProxyMiddleware
README
* HttpProxyMiddleware
A middleware for scrapy. Used to change HTTP proxy from time to time.
Initial proxyes are stored in a file. During runtime, the middleware
will fetch new proxyes if it finds out lack of valid proxyes.Related blog: [[http://www.kohn.com.cn/wordpress/?p=208]]
** fetch_free_proxyes.py
Used to fetch free proxyes from the Internet. Could be modified by
youself.** Usage
*** settings.py
#+BEGIN_SRC python
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 351,
# put this middleware after RetryMiddleware
'crawler.middleware.HttpProxyMiddleware': 999,
}DOWNLOAD_TIMEOUT = 10 # 10-15 second is an experienmental reasonable timeout
#+END_SRC*** change proxy
Often, we wanna change to use a new proxy when our spider gets banned.
Just recognize your IP being banned and yield a new Request in your
Spider.parse method with:#+BEGIN_SRC python
request.meta["change_proxy"] = True
#+END_SRCSome proxy may return invalid HTML code. So if you get any exception
during parsing response, also yield a new request with:#+BEGIN_SRC python
request.meta["change_proxy"] = True
#+END_SRC*** spider.py
Your spider should specify an array of status code where your spider
may encouter during crawling. Any status code that is not 200 nor in
the array would be treated as a result of invalid proxy and the proxy
would be discarded. For example:#+BEGIN_SRC python
website_possible_httpstatus_list = [404]
#+END_SRCThis line tolds the middleware that the website you're crawling would
possibly return a response whose status code is 404, and do not
discard the proxy that this request is using.** Test
Update HttpProxyMiddleware.py path in
HttpProxyMiddlewareTest/settings.py.#+BEGIN_SRC sh
cd HttpProxyMiddlewareTest
scrapy crawl test
#+END_SRCThe testing server is hosted on my VPS, so take it easy... DO NOT
waste too much of my data plan.You may start your own testing server using IPBanTest which is powered
by Django.