https://github.com/aa-nadim/trip-crawler
This project is designed for web scraping and managing data using Docker containers. It includes services for a PostgreSQL database, a web scraper, pgAdmin, and a testing setup.
https://github.com/aa-nadim/trip-crawler
scrapy scrapy-crawler scrapy-spider
Last synced: 3 months ago
JSON representation
This project is designed for web scraping and managing data using Docker containers. It includes services for a PostgreSQL database, a web scraper, pgAdmin, and a testing setup.
- Host: GitHub
- URL: https://github.com/aa-nadim/trip-crawler
- Owner: aa-nadim
- Created: 2024-12-15T10:26:21.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-28T20:40:58.000Z (5 months ago)
- Last Synced: 2024-12-28T21:18:38.364Z (5 months ago)
- Topics: scrapy, scrapy-crawler, scrapy-spider
- Language: Python
- Homepage:
- Size: 30.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Trip Crawler
This project is designed for web scraping and managing data using Docker containers. It includes services for a PostgreSQL database, a web scraper, pgAdmin, and a testing setup.
---
## Prerequisites
Make sure the following software is installed on your system:
- Python 3.x
- PostgreSQL installed and running
- Virtual environment
- [Docker](https://www.docker.com/)
- [Docker Compose](https://docs.docker.com/compose/)---
## Setup Instructions
0. Make sure you have the correct environment variables set in the `.env` and `config.py` files.
```bash
# config.py
DB_USERNAME='your_username'
DB_PASSWORD='your_password'
DB_HOST='localhost'
DB_PORT='PORT'
DB_NAME='DATABASE'# .env
DB_USERNAME='your_username'
DB_PASSWORD='your_password'
DB_NAME='DATABASE'
TEST_DB_NAME='test_db'
DB_PORT='PORT'
DB_HOST='postgres'
```1. Clone the repository:
```bash
git clone https://github.com/aa-nadim/trip-crawler.git
cd trip-crawler
```2. Create a virtual environment and activate it:
```bash
python3 -m venv .venv
source .venv/bin/activate # On Windows use `source .venv/Scripts/activate`
```3. Install the required Python packages:
```bash
pip install -r requirements.txt
```4. Stop and Remove all running containers, networks, images (clean up):
```bash
docker stop $(docker ps -q) # Stops all running containers by their IDs.docker rm $(docker ps -aq) # Removes all containers, whether stopped or exited.
docker image prune # Remove dangling images (unused layers). Removes image layers not associated with any container.
docker image prune -a # Remove all unused images (dangling and unreferenced). Deletes all unused images, including dangling and unreferenced ones.
docker volume prune # Remove unused volumes. Cleans up volumes not connected to any container.
docker network prune # Remove unused networks. Removes all networks not currently used by containers.
docker system prune -a --volumes # Remove all unused data (containers, networks, images, and volumes). Cleans up all unused containers, images, networks, and volumes.
```5. Build and start the Docker containers:
i. Build the `Scrapy`, `Postgres`, `Tests` and `Pgadmin` Docker images:
```bash
docker-compose up -d --build
```
`Note: Portainer will be available at https://localhost:9443`6. Run the Scrapy spider: When we ran `docker-compose up -d --build` in the `trip_crawler` directory, it automatically started the Scrapy spider and stoped after scraping the data. To run the spider manually, you can use the following command:
```bash
docker-compose run scrapy crawl tripCrawler
```
Or, you can use Portainer to run the Scrapy spider by clicking on the `Containers` tab, then clicking on the `scraper_Container` container, and then clicking on the `Start` button.You can see the scraped data in the `hotels` table in the `scraping_db` database in `http://localhost:5050/browser/`
7. Run the tests: When we ran `docker-compose up -d --build` in the `trip_crawler` directory, it automatically started the tests and stoped after running the tests. To run the tests manually, you can use the following command:
```bash
docker-compose run tests pytest --cov=trip_crawler tests/ --cov-report=html
```
Or, you can use Portainer to run the tests by clicking on the `Containers` tab, then clicking on the `tests_Container` container, and then clicking on the `Start` button.you can see the tests in `htmlcov/index.html`