https://github.com/santteegt/directv-scraper

Scraping the Latin America DirecTV programming guide by implementing a spider job using Scrapy.
https://github.com/santteegt/directv-scraper

docker-compose python scrapy scrapyd-api

Last synced: 3 months ago
JSON representation

Scraping the Latin America DirecTV programming guide by implementing a spider job using Scrapy.

Host: GitHub
URL: https://github.com/santteegt/directv-scraper
Owner: santteegt
License: mit
Created: 2018-10-11T22:42:00.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-10-11T22:46:43.000Z (over 7 years ago)
Last Synced: 2025-02-10T11:35:50.678Z (over 1 year ago)
Topics: docker-compose, python, scrapy, scrapyd-api
Language: Jupyter Notebook
Size: 16.6 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## Directv Programming Guide Scraper

Scraping the Latin America DirecTV programming guide by implementing a spider job using Scrapy.

### Software Requirements

* Python 2/3
* pip
* Scrapy
* Docker

### Setup Instructions

```
~$ pip install scrapy scrapyd scrapyd-client
```

### Spider Configuration

* TV_CHANNEL_RAGE: set the range of channels to scrape programming info Default value is `(130, 600)`. You can modify this value in [directv_spider.py](directvscraper/spiders/directv_spider.py) file

### Running Locally

```
~$ scrapy crawl directv -o directv.jl
```

### Deployment

```
~ $ docker-compose up -d scrapyd
~ $ scrapyd-deploy -p directvscraper # deployed the `eggifyed` project
```

* You can see default server conf using `scrapyd-deploy -l`, while deployed spider proyects `scrapyd-deploy -L default`

* To schedule the spider, run the following:

```
~ $ curl http://localhost:6800/schedule.json -d project=directvscraper -d spider=directv
```

* Alternatively you can use the [Directv Programing Guide - Data Cleaning.ipynb](Directv Programing Guide - Data Cleaning.ipynb) notebook to schedule and then clean the scrapped data.

* To check the progress of the spider job, visit `http://localhost:6800/jobs`

* To cancel the job, you can run the following:

```
~ $ curl http://localhost:6800/cancel.json -d project=directvscraper -d job=
```

### Debugging

If you want to debug specific pages, you can run the following code

```
~ $ scrapy shell
```

### More about Scrapyd

More info about the HTTP services available [here](https://doc.scrapy.org/en/0.12/topics/scrapyd.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/santteegt/directv-scraper

Awesome Lists containing this project

README