https://github.com/zseta/scrapy-templates
https://github.com/zseta/scrapy-templates
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/zseta/scrapy-templates
- Owner: zseta
- License: mit
- Created: 2021-03-04T17:27:18.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2021-03-04T17:38:47.000Z (almost 5 years ago)
- Last Synced: 2025-06-26T00:02:09.245Z (8 months ago)
- Language: Python
- Size: 5.86 KB
- Stars: 20
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scrapy-templates
Scrapy spider templates for different kinds of websites.
## How to figure out which template you need?
The quickest way is to figure out your **crawling logic** first, what requests you
need to make to get to the data, in terms of behaviour & deepness:
### Behaviour
Behaviour describes what the spider does on the level it's on at the moment.
This can be:
- **Extraction** (extract data fields)
- **Following** (making a request to go deeper in the website eg. from `example.com`
to `example.com/page`)
- **Pagination** (making a request to paginate - spider stays on the same level eg.
from `example.com/page/1` to `example.com/page/2`)
### Deepness
Deepness means how deep your spider is at the moment (while crawling),
relative to the start url.
For example, if your spider starts at `example.com` that's *level 0*, then you make a request on
that page to `example.com/page`, your spider is on *level 1* now. Then, if you
go to `example.com/page/sub-page`, that's *level 2*.
## Currently available templates
### - [Ext.py](templates/Ext.py)
1. Extract data fields (level 0)
### - [ExtPag.py](templates/ExtPag.py)
1. Extract data fields + paginate (level 0)
### - [Fol_Ext.py](templates/Fol_Ext.py)
1. Follow urls (level 0)
2. Extract data fields (level 1)
### - [Fol_ExtPag.py](templates/Fol_ExtPag.py)
1. Follow urls (level 0)
2. Extract data fields + paginate (level 1)
### - [Fol_Fol_Ext.py](templates/Fol_Fol_Ext.py)
1. Follow urls (level 0)
2. Follow urls (level 1)
3. Extract data fields + paginate (level 2)
### - [Fol_FolPag_Ext.py](templates/Fol_FolPag_Ext.py)
1. Follow urls (level 0)
2. Follow urls + paginate (level 1)
3. Extract data fields (level 2)
### - [FolPag_Ext.py](templates/FolPag_Ext.py)
1. Follow urls + paginate (level 0)
2. Extract data fields (level 1)
### - [sitemap.py](templates/sitemap.py)
1. Extract data from sitemap
## File naming convention
Each template file name is supposed to clearly show the crawling logic of the
spider. So once you know the crawling logic you need for the website
and understand the naming convention of the files, you should be able to
pick your template.
A template file's name contains all the behaviours a spider does, which can be:
- **Extraction** --> represented as `Ext` in the file name
- **Following** --> represented as `Fol` in the file name
- **Pagination** --> represented as `Pag` in the file name
Two behaviours in the file name are separated by an `_` (underscore) if the
second behaviour is done one level deeper. If they are not separated by an
underscore, that means they happen on the same level.
## How to contribute?
### Submit a new template
The most useful thing you can do is to submit a new sipder template which
hasn't been made yet. You can do this by:
1. Fork this repo
2. Add the new template file you created
3. Submit a pull request according to guidelines
### Request a new template to be made
If you have an idea for a template, and you don't feel like submitting a pull
request, create an issue about it. So maybe someone will take it and implment it.