https://github.com/aivarsk/scrapy-proxies

Random proxy middleware for Scrapy
https://github.com/aivarsk/scrapy-proxies

Last synced: 7 months ago
JSON representation

Random proxy middleware for Scrapy

Host: GitHub
URL: https://github.com/aivarsk/scrapy-proxies
Owner: aivarsk
License: mit
Created: 2013-01-21T20:42:14.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2019-10-01T06:30:18.000Z (about 6 years ago)
Last Synced: 2024-09-30T21:49:01.789Z (about 1 year ago)
Language: Python
Homepage:
Size: 18.6 KB
Stars: 1,653
Watchers: 59
Forks: 409
Open Issues: 38
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome - scrapy-proxies - Random proxy middleware for Scrapy (Scrapy Middleware)
awesome-scrapy - scrapy-proxies

README

          Random proxy middleware for Scrapy (http://scrapy.org/)

=======================================================

Processes Scrapy requests using a random proxy from list to avoid IP ban and

improve crawling speed.

Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file

and reformat to http://host:port format)

Install

--------

The quick way:

    pip install scrapy_proxies

Or checkout the source and run

    python setup.py install

settings.py

-----------

    # Retry many times since proxies often fail

    RETRY_TIMES = 10

    # Retry on most error codes since proxies fail for different reasons

    RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

    DOWNLOADER_MIDDLEWARES = {

        'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,

        'scrapy_proxies.RandomProxy': 100,

        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,

    }

    # Proxy list containing entries like

    # http://host1:port

    # http://username:password@host2:port

    # http://host3:port

    # ...

    PROXY_LIST = '/path/to/proxy/list.txt'

    

    # Proxy mode

    # 0 = Every requests have different proxy

    # 1 = Take only one proxy from the list and assign it to every requests

    # 2 = Put a custom proxy to use in the settings

    PROXY_MODE = 0

    

    # If proxy mode is 2 uncomment this sentence :

    #CUSTOM_PROXY = "http://host1:port"

For older versions of Scrapy (before 1.0.0) you have to use

scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and

scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware

middlewares instead.

Your spider

-----------

In each callback ensure that proxy /really/ returned your target page by

checking for site logo or some other significant element.

If not - retry request with dont_filter=True

    if not hxs.select('//get/site/logo'):

        yield Request(url=response.url, dont_filter=True)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aivarsk/scrapy-proxies

Awesome Lists containing this project

README