Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/koverholt/scrapy-site-downloader

Template project for downloading a site with Scrapy
https://github.com/koverholt/scrapy-site-downloader

Last synced: 9 days ago
JSON representation

Template project for downloading a site with Scrapy

Host: GitHub
URL: https://github.com/koverholt/scrapy-site-downloader
Owner: koverholt
License: apache-2.0
Created: 2023-09-29T00:02:17.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-04T17:22:30.000Z (10 months ago)
Last Synced: 2024-12-24T04:44:26.790Z (17 days ago)
Language: Python
Size: 11.7 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# scrapy-site-downloader

## Overview

Template project for downloading a site with Scrapy. Crawls, scrapes, and saves
HTML files from a given website, domain, and URL filters.

## Steps to run

1. Clone this repository and `cd` into it
1. Install the dependencies using the following command:
```
pip install -r requirements.txt
```
1. Configure the `crawler/spiders/site.py` file for the site you want to crawl
1. Start the downloader using the following command (be sure to run this from
the repository root!):
```
scrapy crawl site
```
1. Refer to the
[Scrapy documentation](https://docs.scrapy.org/en/latest/topics/practices.html)
for best practices and other configuration options
1. When the crawler finishes, the HTML files will be located in the `/html`
directory