Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/theoddysey/yts-pipeline-postgres

ETL pipeline πŸͺˆ for scraping, transforming, and loading YTS movie data 🎞️ into PostgreSQL πŸ›’οΈ Container using Docker🐳
https://github.com/theoddysey/yts-pipeline-postgres

bs4-requests docker docker-compose etl-pipeline pandas pipeline-dock-tech postresql python3 scraping-web yaml

Last synced: about 2 months ago
JSON representation

ETL pipeline πŸͺˆ for scraping, transforming, and loading YTS movie data 🎞️ into PostgreSQL πŸ›’οΈ Container using Docker🐳

Awesome Lists containing this project

README

        

# YTS MovieData ETL Pipeline

![Python](https://img.shields.io/badge/Python-3.8-blue)
![PostgreSQL](https://img.shields.io/badge/PostgreSQL-16.2-blue)
![Docker](https://img.shields.io/badge/Docker-20.10-blue)
![Pandas](https://img.shields.io/badge/Pandas-1.3.3-blue)
![BeautifulSoup](https://img.shields.io/badge/BeautifulSoup-4.10.0-blue)

## Overview

The **MovieData ETL Pipeline** project is a comprehensive solution for extracting, transforming, and loading movie data from web sources into a PostgreSQL database. This project is designed with scalability and efficiency in mind, leveraging Docker for containerization, Python for scripting, and PostgreSQL for data storage.

## Project Workflow

## Table of Contents

- [Project Structure](#project-structure)
- [Prerequisites](#prerequisites)
- [Setup and Installation](#setup-and-installation)
- [Usage](#usage)
- [ETL Process Details](#etl-process-details)
- [Configuration](#configuration)
- [Future Improvements](#future-improvements)
- [Contributing](#contributing)
- [License](#license)

## Project Structure

```plaintext
MovieData-ETL
β”œβ”€β”€ elt
β”‚ β”œβ”€β”€ Dockerfile
β”‚ β”œβ”€β”€ elt_pipeline.py
β”‚ └── requirements.txt
β”œβ”€β”€ source_db
β”‚ β”œβ”€β”€ Dockerfile
β”‚ β”œβ”€β”€ create_movies.sql
β”‚ └── insert_movies.sql
β”œβ”€β”€ worker.py
β”œβ”€β”€ output.csv
β”œβ”€β”€ output.xlsx
β”œβ”€β”€ report.md
β”œβ”€β”€ yts.txt
β”œβ”€β”€ docker-compose.yaml
└── README.md
```

### `elt` Directory

- **Dockerfile**: Defines the environment for running the ETL script.
- **elt_pipeline.py**: Contains the ETL logic to load data into PostgreSQL.
- **requirements.txt**: Lists the Python dependencies.

### `source_db` Directory

- **Dockerfile**: Defines the environment for the PostgreSQL database.
- **create_movies.sql**: SQL script to create the movies table.
- **insert_movies.sql**: SQL script to insert movie data.

### Root Directory

- **worker.py**: Script to scrape movie data from the web and generate SQL insert statements.
- **output.csv**: CSV file containing scraped movie data.
- **output.xlsx**: Excel file containing scraped movie data.
- **report.md**: Markdown file for logging the ETL process.
- **yts.txt**: Raw HTML content of the scraped web page.
- **docker-compose.yaml**: Docker Compose configuration to orchestrate the ETL pipeline.

## Prerequisites

Ensure you have the following installed:

- [Docker](https://docs.docker.com/get-docker/): For containerization.
- [Docker Compose](https://docs.docker.com/compose/install/): To manage multi-container applications.
- [Python 3.8](https://www.python.org/downloads/): For running the worker script.

## Setup and Installation

1. **Clone the Repository:**

```bash
git clone https://github.com/TheODDYSEY/YTS-Pipeline-Postgres
cd MovieData-ETL
```

2. **Build and Run the Docker Containers:**

```bash
docker-compose up --build
```

This command builds the Docker images and starts the containers for both the PostgreSQL database and the ETL script.

## Usage

1. **Scrape Movie Data:**

The `worker.py` script scrapes the latest movie data from YTS and generates SQL insert statements.

```bash
python worker.py
```

This will generate `output.csv`, `output.xlsx`, and `insert_movies.sql` files with the scraped movie data.

2. **Run the ETL Pipeline:**

The `elt_pipeline.py` script will wait for the PostgreSQL database to be ready, then load the movie data from the SQL script into the database.

```bash
docker-compose up elt_script
```

## ETL Process Details

### Extract

The extraction step involves scraping movie data from the [YTS website](https://yts.mx/browse-movies/0/all/all/0/featured/0/all) using the `worker.py` script. BeautifulSoup is used to parse the HTML and extract relevant movie information.

### Transform

The extracted data is transformed into a structured format using Pandas. The data is saved into CSV and Excel files for reference and further processing.

### Load

The transformed data is loaded into the PostgreSQL database. The `elt_pipeline.py` script uses the `psycopg2` library to connect to the database and load the data using SQL insert statements generated by the `worker.py` script.

### Script Breakdown

#### `worker.py`

- **Step 1**: Sends a GET request to the specified URL to fetch movie data.
- **Step 2**: Parses the HTML content using BeautifulSoup.
- **Step 3**: Extracts movie details and stores them in a list.
- **Step 4**: Creates a Pandas DataFrame from the extracted data.
- **Step 5**: Saves the DataFrame to CSV and Excel files.
- **Step 6**: Generates SQL insert statements and saves them to a `.sql` file.

#### `elt_pipeline.py`

- **Database Connection**: Connects to the PostgreSQL database using `psycopg2`.
- **Data Loading**: Loads the movie data from the SQL script into the database.
- **Retry Logic**: Implements retry logic to handle transient connection issues.

## Configuration

The PostgreSQL database configuration and credentials are set in the `docker-compose.yaml` file and passed as environment variables to the Docker containers. Ensure the following values are correctly set:

```yaml
services:
movies_postgres:
environment:
POSTGRES_DB: movies_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: secret
```

## Future Improvements

- **Automated Scheduling**: Integrate a scheduler like `cron` or `Airflow` to automate the ETL process at regular intervals.
- **Data Validation**: Implement data validation and cleaning steps to ensure data quality.
- **Enhanced Logging**: Add more detailed logging to track the ETL process and handle errors more effectively.

## Contributing

Contributions are welcome! Please read the [contributing guidelines](CONTRIBUTING.md) for more details on how to get started.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

---

For any inquiries or support, please open an issue on our [GitHub repository](https://github.com/TheODDYSEY/YTS-Pipeline-Postgres/issues).