Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zenrows/scaling-to-distributed-crawling
Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.
https://github.com/zenrows/scaling-to-distributed-crawling
crawler crawling distributed python python3 scraping spider
Last synced: about 10 hours ago
JSON representation
Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.
- Host: GitHub
- URL: https://github.com/zenrows/scaling-to-distributed-crawling
- Owner: ZenRows
- License: mit
- Created: 2021-08-18T13:23:37.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2021-10-29T11:27:33.000Z (about 3 years ago)
- Last Synced: 2023-05-03T11:15:13.537Z (over 1 year ago)
- Topics: crawler, crawling, distributed, python, python3, scraping, spider
- Language: HTML
- Homepage: https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling
- Size: 116 KB
- Stars: 28
- Watchers: 4
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# crawling-scale-up
Repository for the [Mastering Web Scraping in Python: Scaling to Distributed Crawling](https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling) blogpost with the final code.
## Installation
You will need [Redis](https://redis.io/) and [python3 installed](https://www.python.org/downloads/). After that, install all the necessary libraries by running `pip install`.```bash
pip install install requests beautifulsoup4 playwright "celery[redis]"
npx playwright install
```## Execute
Configure the Redis connection on the [repo file](./repo.py) and Celery on the [tasks file](./tasks.py).
You need to start Celery and the run the main script that will start queueing pages to crawl.
```bash
celery -A tasks worker
``````python
python3 main.py
```## Contributing
Pull requests are welcome. For significant changes, please open an issue first to discuss what you would like to change.## License
[MIT](./LICENSE)