Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zytedata/zyte-spider-templates-project
https://github.com/zytedata/zyte-spider-templates-project
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/zytedata/zyte-spider-templates-project
- Owner: zytedata
- License: bsd-3-clause
- Created: 2023-10-20T10:31:04.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-17T13:31:39.000Z (3 months ago)
- Last Synced: 2024-09-17T16:52:40.185Z (3 months ago)
- Language: Python
- Size: 122 KB
- Stars: 11
- Watchers: 5
- Forks: 8
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
=============================
zyte-spider-templates-project
=============================This is a starting template for a `Scrapy
`_ project, with built-in integration with
Zyte technologies (`scrapy-zyte-api
`_,
`zyte-spider-templates`_).Requirements
============* Python 3.8+
* Scrapy 2.11+
* zyte-spider-templatesYou also need a `Zyte API`_ subscription for Zyte API features, including AI-powered spiders.
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html
First steps
===========After you clone this repository, follow these step to make it yours:
#. Rename the ``zyte_spider_templates_project`` folder to a valid Python
module name that you would like to use as your project ID, and update
``scrapy.cfg`` and ``/settings.py`` (``BOT_NAME``,
``SPIDER_MODULES``, ``NEWSPIDER_MODULE`` and ``SCRAPY_POET_DISCOVER``
settings) accordingly.#. For local development, assign your `Zyte API key
`_ to the ``ZYTE_API_KEY``
environment variable, for example, using `direnv `_... note:: `Scrapy Cloud
`_
automatically provides Zyte API key for the jobs, if you have a
subscription.#. Remove or replace the ``LICENSE`` and ``README.rst`` files.
#. Delete ``.git``, and start a fresh Git repository::
git init
git add -A
git commit -m "Initial commit"#. Create a Python virtual environment and install ``requirements.txt`` into
it::python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txtUsage
=====This is an already created and configured Scrapy project so when you follow
guides like the `Scrapy Cloud tutorial
`_ you should skip
most of the parts that talk about creating and configuring it. Still, you need
some additional configuration specific to your account. Here is a short guide
for using this project on Scrapy Cloud.#. Create a Scrapy Cloud project on the Zyte dashboard if you don't have it
yet.
#. Make sure you have a Zyte API subscription. For Scrapy Cloud runs the API
key will be used automatically, for local runs you need to set a setting or
an environment variable, as described in the first steps above.
#. Run ``shub login`` and enter your `Scrapy Cloud API key
`_.
#. Deploy your project with ``shub deploy 000000``, replacing ``000000`` with
your Scrapy Cloud project ID (found in the project dashboard URL).
Alternatively, put the project ID into the ``scrapinghub.yml`` file to be
able to run simply ``shub deploy``.
#. Now you should be able to `create smart spiders
`_
on your Scrapy Cloud project using the templates from this project.For more information and more verbose descriptions of specific steps you can
check:* `The Scrapy documentation `_.
* `The Scrapy Cloud tutorial
`_.
* `The shub documentation `_.
* `The Zyte API documentation
`_.
* `The zyte-spider-templates documentation
`_.You can also run the spiders locally, for example::
scrapy crawl ecommerce -a url="https://books.toscrape.com/" -o output.jsonl
Development
===========By default all spiders and page objects defined in `zyte-spider-templates`_ are
available in this project. You can also:- Subclass spiders from `zyte-spider-templates`_ or `write spiders
from scratch `_.Define your spiders in Python files and modules within
``/spiders/``.- Use `web-poet `_ and
`scrapy-poet `_ to modify
the parsing behavior of spiders, in all, some, or specific websites.Define your page objects in Python files and modules within
``/pages/``... _zyte-spider-templates: https://github.com/zytedata/zyte-spider-templates