https://github.com/u123dev/scraping_olx
https://github.com/u123dev/scraping_olx
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/u123dev/scraping_olx
- Owner: u123dev
- Created: 2025-01-23T15:34:34.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-01-23T15:41:09.000Z (4 months ago)
- Last Synced: 2025-01-23T16:34:43.081Z (4 months ago)
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Scraping Site olx
This project scrapes ads listings from website olx and saves it to database.
### Features:
- Scrapy framework.For accessing some fields (ex. phones) used internal olx api.
It is strongly recommended to use proxy to avoid ip blocking when there is a large flow of requests.- Scraping App starts every 1 min.
It analyzes the first 5 website pages to find the new created ads to save.- Dumping db starts daily at 12 am (timezone=Europe/Kiev) as a separate process.
Path to dumps in root: ```dumps/```- Configured Logging system for log file rotation (5 files each 1Gb).
Path to logs - in rootApplications are deployed in docker containers:
- scrapy apps volume
- celery worker for scraping
- celery worker for dumping
- celery beat cron scheduler
- flower tasks monitoring
- redis as broker
- postgresql db
- db data volume___
### Tech Stack & System requirements :* Python 3.1+
* Scrapy
* SqlAlchemy orm
* Alembic
* PostgreSQL Database
* Celery
* Redis (used as a Broker & Backend)
* Flower (monitoring for Celery)
* Docker Containerization---
### Run with Docker containers
System requirements:* **Docker Desktop 4.+**
Run project:
```
docker-compose up --build
```Please note:
* Copy [.env-sample](.env-sample) file to **.env** & set environment variables#### Tasks monitoring - Access Flower / Celery tasks monitoring
- [http://127.0.0.1:5555/tasks/](http://127.0.0.1:8000/5555/tasks/)### Contact
Feel free to contact: [email protected]