Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/theoddysey/yts-pipeline-postgres
ETL pipeline πͺ for scraping, transforming, and loading YTS movie data ποΈ into PostgreSQL π’οΈ Container using Dockerπ³
https://github.com/theoddysey/yts-pipeline-postgres
bs4-requests docker docker-compose etl-pipeline pandas pipeline-dock-tech postresql python3 scraping-web yaml
Last synced: about 2 months ago
JSON representation
ETL pipeline πͺ for scraping, transforming, and loading YTS movie data ποΈ into PostgreSQL π’οΈ Container using Dockerπ³
- Host: GitHub
- URL: https://github.com/theoddysey/yts-pipeline-postgres
- Owner: TheODDYSEY
- Created: 2024-05-22T11:23:26.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-05-28T18:03:46.000Z (8 months ago)
- Last Synced: 2024-05-29T10:14:31.897Z (8 months ago)
- Topics: bs4-requests, docker, docker-compose, etl-pipeline, pandas, pipeline-dock-tech, postresql, python3, scraping-web, yaml
- Language: Python
- Homepage:
- Size: 66.4 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
# YTS MovieData ETL Pipeline
![Python](https://img.shields.io/badge/Python-3.8-blue)
![PostgreSQL](https://img.shields.io/badge/PostgreSQL-16.2-blue)
![Docker](https://img.shields.io/badge/Docker-20.10-blue)
![Pandas](https://img.shields.io/badge/Pandas-1.3.3-blue)
![BeautifulSoup](https://img.shields.io/badge/BeautifulSoup-4.10.0-blue)## Overview
The **MovieData ETL Pipeline** project is a comprehensive solution for extracting, transforming, and loading movie data from web sources into a PostgreSQL database. This project is designed with scalability and efficiency in mind, leveraging Docker for containerization, Python for scripting, and PostgreSQL for data storage.
## Project Workflow
## Table of Contents
- [Project Structure](#project-structure)
- [Prerequisites](#prerequisites)
- [Setup and Installation](#setup-and-installation)
- [Usage](#usage)
- [ETL Process Details](#etl-process-details)
- [Configuration](#configuration)
- [Future Improvements](#future-improvements)
- [Contributing](#contributing)
- [License](#license)## Project Structure
```plaintext
MovieData-ETL
βββ elt
β βββ Dockerfile
β βββ elt_pipeline.py
β βββ requirements.txt
βββ source_db
β βββ Dockerfile
β βββ create_movies.sql
β βββ insert_movies.sql
βββ worker.py
βββ output.csv
βββ output.xlsx
βββ report.md
βββ yts.txt
βββ docker-compose.yaml
βββ README.md
```### `elt` Directory
- **Dockerfile**: Defines the environment for running the ETL script.
- **elt_pipeline.py**: Contains the ETL logic to load data into PostgreSQL.
- **requirements.txt**: Lists the Python dependencies.### `source_db` Directory
- **Dockerfile**: Defines the environment for the PostgreSQL database.
- **create_movies.sql**: SQL script to create the movies table.
- **insert_movies.sql**: SQL script to insert movie data.### Root Directory
- **worker.py**: Script to scrape movie data from the web and generate SQL insert statements.
- **output.csv**: CSV file containing scraped movie data.
- **output.xlsx**: Excel file containing scraped movie data.
- **report.md**: Markdown file for logging the ETL process.
- **yts.txt**: Raw HTML content of the scraped web page.
- **docker-compose.yaml**: Docker Compose configuration to orchestrate the ETL pipeline.## Prerequisites
Ensure you have the following installed:
- [Docker](https://docs.docker.com/get-docker/): For containerization.
- [Docker Compose](https://docs.docker.com/compose/install/): To manage multi-container applications.
- [Python 3.8](https://www.python.org/downloads/): For running the worker script.## Setup and Installation
1. **Clone the Repository:**
```bash
git clone https://github.com/TheODDYSEY/YTS-Pipeline-Postgres
cd MovieData-ETL
```2. **Build and Run the Docker Containers:**
```bash
docker-compose up --build
```This command builds the Docker images and starts the containers for both the PostgreSQL database and the ETL script.
## Usage
1. **Scrape Movie Data:**
The `worker.py` script scrapes the latest movie data from YTS and generates SQL insert statements.
```bash
python worker.py
```This will generate `output.csv`, `output.xlsx`, and `insert_movies.sql` files with the scraped movie data.
2. **Run the ETL Pipeline:**
The `elt_pipeline.py` script will wait for the PostgreSQL database to be ready, then load the movie data from the SQL script into the database.
```bash
docker-compose up elt_script
```## ETL Process Details
### Extract
The extraction step involves scraping movie data from the [YTS website](https://yts.mx/browse-movies/0/all/all/0/featured/0/all) using the `worker.py` script. BeautifulSoup is used to parse the HTML and extract relevant movie information.
### Transform
The extracted data is transformed into a structured format using Pandas. The data is saved into CSV and Excel files for reference and further processing.
### Load
The transformed data is loaded into the PostgreSQL database. The `elt_pipeline.py` script uses the `psycopg2` library to connect to the database and load the data using SQL insert statements generated by the `worker.py` script.
### Script Breakdown
#### `worker.py`
- **Step 1**: Sends a GET request to the specified URL to fetch movie data.
- **Step 2**: Parses the HTML content using BeautifulSoup.
- **Step 3**: Extracts movie details and stores them in a list.
- **Step 4**: Creates a Pandas DataFrame from the extracted data.
- **Step 5**: Saves the DataFrame to CSV and Excel files.
- **Step 6**: Generates SQL insert statements and saves them to a `.sql` file.#### `elt_pipeline.py`
- **Database Connection**: Connects to the PostgreSQL database using `psycopg2`.
- **Data Loading**: Loads the movie data from the SQL script into the database.
- **Retry Logic**: Implements retry logic to handle transient connection issues.## Configuration
The PostgreSQL database configuration and credentials are set in the `docker-compose.yaml` file and passed as environment variables to the Docker containers. Ensure the following values are correctly set:
```yaml
services:
movies_postgres:
environment:
POSTGRES_DB: movies_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: secret
```## Future Improvements
- **Automated Scheduling**: Integrate a scheduler like `cron` or `Airflow` to automate the ETL process at regular intervals.
- **Data Validation**: Implement data validation and cleaning steps to ensure data quality.
- **Enhanced Logging**: Add more detailed logging to track the ETL process and handle errors more effectively.## Contributing
Contributions are welcome! Please read the [contributing guidelines](CONTRIBUTING.md) for more details on how to get started.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---
For any inquiries or support, please open an issue on our [GitHub repository](https://github.com/TheODDYSEY/YTS-Pipeline-Postgres/issues).