Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jeffbla/rent-house-crawler
This is a distributed web crawler using Scrapy, Redis, and Selenium. It is designed to handle various types of websites, including static, AJAX, and dynamic pages.
https://github.com/jeffbla/rent-house-crawler
ajax distributed-systems dynamic-site mongodb redis selenium static-site web-crawling
Last synced: 9 days ago
JSON representation
This is a distributed web crawler using Scrapy, Redis, and Selenium. It is designed to handle various types of websites, including static, AJAX, and dynamic pages.
- Host: GitHub
- URL: https://github.com/jeffbla/rent-house-crawler
- Owner: JeffBla
- Created: 2024-08-09T13:23:52.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2024-09-12T03:08:57.000Z (4 months ago)
- Last Synced: 2024-11-01T02:23:40.716Z (about 2 months ago)
- Topics: ajax, distributed-systems, dynamic-site, mongodb, redis, selenium, static-site, web-crawling
- Language: Python
- Homepage:
- Size: 307 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Rent House Crawler
## Description
This is a distributed web crawler using Scrapy, Redis, and Selenium. It is designed to handle various types of websites, including static, AJAX, and dynamic pages. By leveraging a distributed setup with docker compose, the system can be deployed across multiple machines to enhance crawling speed and efficiency.
**Note:**
- ddroom -> Ajax
- housefun -> dynamic (with selenium)
- rakuya -> static
### Architecture Overview:
The architecture features a central queue managed by Redis, which distributes tasks to multiple Scrapy crawlers. The crawlers process the tasks and store the collected data in MongoDB.
## Prerequest
There is no MongoDB container in the docker-compose. You need to rewrite the docker-compose or set up MongoDB locally.
1. Setup MonogoDB locally or modify the docker-compose.
2. Adjust the environment variable to make the project find your MongoDB database.## Install
There are two ways to set up.
1. local set up
1. Pip install
```python
pip install -r requirements.txt
```
2. Push url to the redis
3. scrapy crawl [ddroom/housefun/rakuya]2. Adopt the docker-compose
1. Build the docker image
```docker
docker build -t scrapy_rent_crawler .
```
2. Run the docker compose
```docker
docker compose up -d
```
if want to debug, remove the -d flag.