Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zytedata/zyte-spider-templates-project

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/zytedata/zyte-spider-templates-project
Owner: zytedata
License: bsd-3-clause
Created: 2023-10-20T10:31:04.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-09-17T13:31:39.000Z (3 months ago)
Last Synced: 2024-09-17T16:52:40.185Z (3 months ago)
Language: Python
Size: 122 KB
Stars: 11
Watchers: 5
Forks: 8
Open Issues: 1
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

        =============================

zyte-spider-templates-project

=============================

This is a starting template for a `Scrapy

`_ project, with built-in integration with

Zyte technologies (`scrapy-zyte-api

`_,

`zyte-spider-templates`_).

Requirements

============

* Python 3.8+

* Scrapy 2.11+

* zyte-spider-templates

You also need a `Zyte API`_ subscription for Zyte API features, including AI-powered spiders.

.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html

First steps

===========

After you clone this repository, follow these step to make it yours:

#.  Rename the ``zyte_spider_templates_project`` folder to a valid Python

    module name that you would like to use as your project ID, and update

    ``scrapy.cfg`` and ``/settings.py`` (``BOT_NAME``,

    ``SPIDER_MODULES``, ``NEWSPIDER_MODULE`` and ``SCRAPY_POET_DISCOVER``

    settings) accordingly.

#.  For local development, assign your `Zyte API key

    `_ to the ``ZYTE_API_KEY``

    environment variable, for example, using `direnv `_.

    .. note:: `Scrapy Cloud

        `_

        automatically provides Zyte API key for the jobs, if you have a

        subscription.

#.  Remove or replace the ``LICENSE`` and ``README.rst`` files.

#.  Delete ``.git``, and start a fresh Git repository::

        git init

        git add -A

        git commit -m "Initial commit"

#.  Create a Python virtual environment and install ``requirements.txt`` into

    it::

        python3 -m venv venv

        . venv/bin/activate

        pip install -r requirements.txt

Usage

=====

This is an already created and configured Scrapy project so when you follow

guides like the `Scrapy Cloud tutorial

`_ you should skip

most of the parts that talk about creating and configuring it. Still, you need

some additional configuration specific to your account. Here is a short guide

for using this project on Scrapy Cloud.

#.  Create a Scrapy Cloud project on the Zyte dashboard if you don't have it

    yet.

#.  Make sure you have a Zyte API subscription. For Scrapy Cloud runs the API

    key will be used automatically, for local runs you need to set a setting or

    an environment variable, as described in the first steps above.

#.  Run ``shub login`` and enter your `Scrapy Cloud API key

    `_.

#.  Deploy your project with ``shub deploy 000000``, replacing ``000000`` with

    your Scrapy Cloud project ID (found in the project dashboard URL).

    Alternatively, put the project ID into the ``scrapinghub.yml`` file to be

    able to run simply ``shub deploy``.

#.  Now you should be able to `create smart spiders

    `_

    on your Scrapy Cloud project using the templates from this project.

For more information and more verbose descriptions of specific steps you can

check:

* `The Scrapy documentation `_.

* `The Scrapy Cloud tutorial

  `_.

* `The shub documentation `_.

* `The Zyte API documentation

  `_.

* `The zyte-spider-templates documentation

  `_.

You can also run the spiders locally, for example::

        scrapy crawl ecommerce -a url="https://books.toscrape.com/" -o output.jsonl

Development

===========

By default all spiders and page objects defined in `zyte-spider-templates`_ are

available in this project. You can also:

-   Subclass spiders from `zyte-spider-templates`_ or `write spiders

    from scratch `_.

    Define your spiders in Python files and modules within

    ``/spiders/``.

-   Use `web-poet `_ and

    `scrapy-poet `_ to modify

    the parsing behavior of spiders, in all, some, or specific websites.

    Define your page objects in Python files and modules within

    ``/pages/``.

.. _zyte-spider-templates: https://github.com/zytedata/zyte-spider-templates