https://github.com/kallsyms/distscrape
Distributed scraping framework
https://github.com/kallsyms/distscrape
Last synced: 2 months ago
JSON representation
Distributed scraping framework
- Host: GitHub
- URL: https://github.com/kallsyms/distscrape
- Owner: kallsyms
- Created: 2018-12-23T05:23:08.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-01-03T03:30:11.000Z (over 6 years ago)
- Last Synced: 2025-01-20T22:55:38.274Z (4 months ago)
- Language: Python
- Size: 81.1 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# distscrape
## What?
Distscrape is a distributed scraping framework written in Python 3 built for large, distributed site scrapes/crawls.
It heavily uses asyncio (and asyncio modules) to be as quick as possible, while still ensuring no data is lost.## Why?
I wrote distscrape originally to crawl YouTube for video annotations, where will be removed in early 2019.
I needed a system that could be easily pre-initalized with hundreds of millions of IDs to crawl, which standalone
scrapy could not handle. Something like [scrapy-redis](https://github.com/rmax/scrapy-redis) could have helped, however
I already wasn't a fan of scrapy's pipelining architecture, so I decided to write my own.## How?
See [ARCHITECTURE.md](./ARCHITECTURE.md)
## Getting Started
The provided [test_yt_crawl.py](./test_yt_crawl.py) shows how the various component implementations can be pieced together.
## TODO
See [TODO.md](./TODO.md)