https://github.com/zenrows/scaling-to-distributed-crawling

Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.
https://github.com/zenrows/scaling-to-distributed-crawling

crawler crawling distributed python python3 scraping spider

Last synced: 7 months ago
JSON representation

Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.

Host: GitHub
URL: https://github.com/zenrows/scaling-to-distributed-crawling
Owner: ZenRows
License: mit
Created: 2021-08-18T13:23:37.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-10-29T11:27:33.000Z (about 4 years ago)
Last Synced: 2025-03-27T21:05:21.079Z (8 months ago)
Topics: crawler, crawling, distributed, python, python3, scraping, spider
Language: HTML
Homepage: https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling
Size: 116 KB
Stars: 42
Watchers: 4
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # crawling-scale-up

Repository for the [Mastering Web Scraping in Python: Scaling to Distributed Crawling](https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling) blogpost with the final code.

## Installation

You will need [Redis](https://redis.io/) and [python3 installed](https://www.python.org/downloads/). After that, install all the necessary libraries by running `pip install`.

```bash

pip install install requests beautifulsoup4 playwright "celery[redis]"

npx playwright install

```

## Execute

Configure the Redis connection on the [repo file](./repo.py) and Celery on the [tasks file](./tasks.py).

You need to start Celery and the run the main script that will start queueing pages to crawl.

```bash

celery -A tasks worker

```

```python

python3 main.py 

```

## Contributing

Pull requests are welcome. For significant changes, please open an issue first to discuss what you would like to change.

## License

[MIT](./LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zenrows/scaling-to-distributed-crawling

Awesome Lists containing this project

README