Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alecxe/scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent
https://github.com/alecxe/scrapy-fake-useragent

python scrapy web-scraping

Last synced: 2 days ago
JSON representation

Random User-Agent middleware based on fake-useragent

Awesome Lists containing this project

README

        

.. image:: https://travis-ci.org/alecxe/scrapy-fake-useragent.svg?branch=master
:target: https://travis-ci.org/alecxe/scrapy-fake-useragent

.. image:: https://codecov.io/gh/alecxe/scrapy-fake-useragent/branch/master/graph/badge.svg
:target: https://codecov.io/gh/alecxe/scrapy-fake-useragent

.. image:: https://img.shields.io/pypi/pyversions/scrapy-fake-useragent.svg
:target: https://pypi.python.org/pypi/scrapy-fake-useragent
:alt: PyPI version

.. image:: https://badge.fury.io/py/scrapy-fake-useragent.svg
:target: http://badge.fury.io/py/scrapy-fake-useragent
:alt: PyPI version

.. image:: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements.svg?branch=master
:target: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements/?branch=master
:alt: Requirements Status

.. image:: https://img.shields.io/badge/license-MIT-blue.svg
:target: https://github.com/alecxe/scrapy-fake-useragent/blob/master/LICENSE.txt
:alt: Package license

scrapy-fake-useragent
=====================

Random User-Agent middleware for Scrapy scraping framework based on
`fake-useragent `__, which picks up ``User-Agent`` strings
based on `usage statistics `__
from a `real world database `__, but also has the option to configure a generator
of fake UA strings, as a backup, powered by
`Faker `__.

It also has the possibility of extending the
capabilities of the middleware, by adding your own providers.

Changes
----------

Please see `CHANGELOG`_.

Installation
-------------

The simplest way is to install it via `pip`:

pip install scrapy-fake-useragent

Configuration
-------------

Turn off the built-in ``UserAgentMiddleware`` and ``RetryMiddleware`` and add
``RandomUserAgentMiddleware`` and ``RetryUserAgentMiddleware``.

In Scrapy >=1.0:

.. code:: python

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

In Scrapy <1.0:

.. code:: python

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

Recommended setting (1.3.0+):

.. code:: python

FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider', # this is the first provider we'll try
'scrapy_fake_useragent.providers.FakerProvider', # if FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
'scrapy_fake_useragent.providers.FixedUserAgentProvider', # fall back to USER_AGENT value
]
USER_AGENT = ''

----------------

Additional configuration information
====================================

Enabling providers
---------------------------

The package comes with a thin abstraction layer of User-Agent providers, which for purposes of backwards compatibility defaults to:

.. code:: python

FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider'
]

The package has also ``FakerProvider`` (powered by `Faker library `__) and ``FixedUserAgentProvider`` implemented and available for use if needed.

Each provider is enabled individually, and used in the order they are defined.
In case a provider fails execute (for instance, it can `happen `__ to fake-useragent because of it's dependency
with an online service), the next one will be used.

Example of what ``FAKEUSERAGENT_PROVIDERS`` setting may look like in your case:

.. code:: python

FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider',
'scrapy_fake_useragent.providers.FakerProvider',
'scrapy_fake_useragent.providers.FixedUserAgentProvider',
'mypackage.providers.CustomProvider'
]

Configuring fake-useragent
---------------------------

Parameter: ``FAKE_USERAGENT_RANDOM_UA_TYPE`` defaulting to ``random``.

Other options, as example:

* ``firefox`` to mimic only Firefox browsers
* ``msie`` to mimic Internet Explorer only
* etc.

You can also set the ``FAKEUSERAGENT_FALLBACK`` option, which is a ``fake-useragent`` specific fallback. For example:

.. code:: python

FAKEUSERAGENT_FALLBACK = 'Mozilla/5.0 (Android; Mobile; rv:40.0)'

What it does is, if the selected ``FAKE_USERAGENT_RANDOM_UA_TYPE`` fails to retrieve a UA, it will use
the type set in ``FAKEUSERAGENT_FALLBACK``.

Configuring faker
---------------------------

Parameter: ``FAKER_RANDOM_UA_TYPE`` defaulting to ``user_agent`` which is the way of selecting totally random User-Agents values.
Other options, as example:

* ``chrome``
* ``firefox``
* ``safari``
* etc. (please refer to `Faker UserAgent provider documentation `_ for the available options)

Configuring FixedUserAgent
---------------------------

It also comes with a fixed provider (only provides one user agent), reusing the Scrapy's default ``USER_AGENT`` setting value.

Usage with `scrapy-proxies`
---------------------------

To use with middlewares of random proxy such as `scrapy-proxies `_, you need:

1. set ``RANDOM_UA_PER_PROXY`` to True to allow switch per proxy

2. set priority of ``RandomUserAgentMiddleware`` to be greater than ``scrapy-proxies``, so that proxy is set before handle UA

License
----------

The package is under MIT license. Please see `LICENSE`_.

.. |GitHub version| image:: https://badge.fury.io/gh/alecxe%2Fscrapy-fake-useragent.svg
:target: http://badge.fury.io/gh/alecxe%2Fscrapy-fake-useragent
.. |Requirements Status| image:: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements.svg?branch=master
:target: https://requires.io/github/alecxe/scrapy-fake-useragent/requirements/?branch=master
.. _LICENSE: https://github.com/alecxe/scrapy-fake-useragent/blob/master/LICENSE.txt
.. _CHANGELOG: https://github.com/alecxe/scrapy-fake-useragent/blob/master/CHANGELOG.rst