https://github.com/theoddysey/yts-pipeline-postgres

ETL pipeline 🪈 for scraping, transforming, and loading YTS movie data 🎞️ into PostgreSQL 🛢️ Container using Docker🐳
https://github.com/theoddysey/yts-pipeline-postgres

bs4-requests docker docker-compose etl-pipeline pandas pipeline-dock-tech postresql python3 scraping-web yaml

Last synced: 3 months ago
JSON representation

ETL pipeline 🪈 for scraping, transforming, and loading YTS movie data 🎞️ into PostgreSQL 🛢️ Container using Docker🐳

Host: GitHub
URL: https://github.com/theoddysey/yts-pipeline-postgres
Owner: TheODDYSEY
Created: 2024-05-22T11:23:26.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-05-28T18:03:46.000Z (about 1 year ago)
Last Synced: 2025-01-31T08:43:30.126Z (5 months ago)
Topics: bs4-requests, docker, docker-compose, etl-pipeline, pandas, pipeline-dock-tech, postresql, python3, scraping-web, yaml
Language: Python
Homepage:
Size: 66.4 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

README

# YTS MovieData ETL Pipeline

![Python](https://img.shields.io/badge/Python-3.8-blue)
![PostgreSQL](https://img.shields.io/badge/PostgreSQL-16.2-blue)
![Docker](https://img.shields.io/badge/Docker-20.10-blue)
![Pandas](https://img.shields.io/badge/Pandas-1.3.3-blue)
![BeautifulSoup](https://img.shields.io/badge/BeautifulSoup-4.10.0-blue)

## Overview

The **MovieData ETL Pipeline** project is a comprehensive solution for extracting, transforming, and loading movie data from web sources into a PostgreSQL database. This project is designed with scalability and efficiency in mind, leveraging Docker for containerization, Python for scripting, and PostgreSQL for data storage.

## Project Workflow

## Table of Contents

- [Project Structure](#project-structure)
- [Prerequisites](#prerequisites)
- [Setup and Installation](#setup-and-installation)
- [Usage](#usage)
- [ETL Process Details](#etl-process-details)
- [Configuration](#configuration)
- [Future Improvements](#future-improvements)
- [Contributing](#contributing)
- [License](#license)

## Project Structure

```plaintext
MovieData-ETL
├── elt
│ ├── Dockerfile
│ ├── elt_pipeline.py
│ └── requirements.txt
├── source_db
│ ├── Dockerfile
│ ├── create_movies.sql
│ └── insert_movies.sql
├── worker.py
├── output.csv
├── output.xlsx
├── report.md
├── yts.txt
├── docker-compose.yaml
└── README.md
```

### `elt` Directory

- **Dockerfile**: Defines the environment for running the ETL script.
- **elt_pipeline.py**: Contains the ETL logic to load data into PostgreSQL.
- **requirements.txt**: Lists the Python dependencies.

### `source_db` Directory

- **Dockerfile**: Defines the environment for the PostgreSQL database.
- **create_movies.sql**: SQL script to create the movies table.
- **insert_movies.sql**: SQL script to insert movie data.

### Root Directory

- **worker.py**: Script to scrape movie data from the web and generate SQL insert statements.
- **output.csv**: CSV file containing scraped movie data.
- **output.xlsx**: Excel file containing scraped movie data.
- **report.md**: Markdown file for logging the ETL process.
- **yts.txt**: Raw HTML content of the scraped web page.
- **docker-compose.yaml**: Docker Compose configuration to orchestrate the ETL pipeline.

## Prerequisites

Ensure you have the following installed:

- [Docker](https://docs.docker.com/get-docker/): For containerization.
- [Docker Compose](https://docs.docker.com/compose/install/): To manage multi-container applications.
- [Python 3.8](https://www.python.org/downloads/): For running the worker script.

## Setup and Installation

1. **Clone the Repository:**

```bash
git clone https://github.com/TheODDYSEY/YTS-Pipeline-Postgres
cd MovieData-ETL
```

2. **Build and Run the Docker Containers:**

```bash
docker-compose up --build
```

This command builds the Docker images and starts the containers for both the PostgreSQL database and the ETL script.

## Usage

1. **Scrape Movie Data:**

The `worker.py` script scrapes the latest movie data from YTS and generates SQL insert statements.

```bash
python worker.py
```

This will generate `output.csv`, `output.xlsx`, and `insert_movies.sql` files with the scraped movie data.

2. **Run the ETL Pipeline:**

The `elt_pipeline.py` script will wait for the PostgreSQL database to be ready, then load the movie data from the SQL script into the database.

```bash
docker-compose up elt_script
```

## ETL Process Details

### Extract

The extraction step involves scraping movie data from the [YTS website](https://yts.mx/browse-movies/0/all/all/0/featured/0/all) using the `worker.py` script. BeautifulSoup is used to parse the HTML and extract relevant movie information.

### Transform

The extracted data is transformed into a structured format using Pandas. The data is saved into CSV and Excel files for reference and further processing.

### Load

The transformed data is loaded into the PostgreSQL database. The `elt_pipeline.py` script uses the `psycopg2` library to connect to the database and load the data using SQL insert statements generated by the `worker.py` script.

### Script Breakdown

#### `worker.py`

- **Step 1**: Sends a GET request to the specified URL to fetch movie data.
- **Step 2**: Parses the HTML content using BeautifulSoup.
- **Step 3**: Extracts movie details and stores them in a list.
- **Step 4**: Creates a Pandas DataFrame from the extracted data.
- **Step 5**: Saves the DataFrame to CSV and Excel files.
- **Step 6**: Generates SQL insert statements and saves them to a `.sql` file.

#### `elt_pipeline.py`

- **Database Connection**: Connects to the PostgreSQL database using `psycopg2`.
- **Data Loading**: Loads the movie data from the SQL script into the database.
- **Retry Logic**: Implements retry logic to handle transient connection issues.

## Configuration

The PostgreSQL database configuration and credentials are set in the `docker-compose.yaml` file and passed as environment variables to the Docker containers. Ensure the following values are correctly set:

```yaml
services:
movies_postgres:
environment:
POSTGRES_DB: movies_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: secret
```

## Future Improvements

- **Automated Scheduling**: Integrate a scheduler like `cron` or `Airflow` to automate the ETL process at regular intervals.
- **Data Validation**: Implement data validation and cleaning steps to ensure data quality.
- **Enhanced Logging**: Add more detailed logging to track the ETL process and handle errors more effectively.

## Contributing

Contributions are welcome! Please read the [contributing guidelines](CONTRIBUTING.md) for more details on how to get started.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

---

For any inquiries or support, please open an issue on our [GitHub repository](https://github.com/TheODDYSEY/YTS-Pipeline-Postgres/issues).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/theoddysey/yts-pipeline-postgres

Awesome Lists containing this project

README