Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/koverholt/scrapy-site-downloader
Template project for downloading a site with Scrapy
https://github.com/koverholt/scrapy-site-downloader
Last synced: 9 days ago
JSON representation
Template project for downloading a site with Scrapy
- Host: GitHub
- URL: https://github.com/koverholt/scrapy-site-downloader
- Owner: koverholt
- License: apache-2.0
- Created: 2023-09-29T00:02:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-04T17:22:30.000Z (10 months ago)
- Last Synced: 2024-12-24T04:44:26.790Z (17 days ago)
- Language: Python
- Size: 11.7 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scrapy-site-downloader
## Overview
Template project for downloading a site with Scrapy. Crawls, scrapes, and saves
HTML files from a given website, domain, and URL filters.## Steps to run
1. Clone this repository and `cd` into it
1. Install the dependencies using the following command:
```
pip install -r requirements.txt
```
1. Configure the `crawler/spiders/site.py` file for the site you want to crawl
1. Start the downloader using the following command (be sure to run this from
the repository root!):
```
scrapy crawl site
```
1. Refer to the
[Scrapy documentation](https://docs.scrapy.org/en/latest/topics/practices.html)
for best practices and other configuration options
1. When the crawler finishes, the HTML files will be located in the `/html`
directory