Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nadahamdy217/movies-data-etl-using-python-gcp
Developed a comprehensive ETL pipeline for movie data using Python, Docker, and a GCP Pub/Sub emulator. Successfully processed and published the data in a local Docker environment, showcasing advanced data engineering skills.
https://github.com/nadahamdy217/movies-data-etl-using-python-gcp
analytics data data-engineering data-ingestion data-preparation data-preprocessing data-processing data-project docker etl etl-pipeline gcp matplotlib matplotlib-pyplot numpy pandas pubsub python scipy seaborn
Last synced: 22 days ago
JSON representation
Developed a comprehensive ETL pipeline for movie data using Python, Docker, and a GCP Pub/Sub emulator. Successfully processed and published the data in a local Docker environment, showcasing advanced data engineering skills.
- Host: GitHub
- URL: https://github.com/nadahamdy217/movies-data-etl-using-python-gcp
- Owner: nadahamdy217
- Created: 2024-08-14T19:18:03.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-08-24T19:25:36.000Z (4 months ago)
- Last Synced: 2024-12-06T08:04:22.954Z (22 days ago)
- Topics: analytics, data, data-engineering, data-ingestion, data-preparation, data-preprocessing, data-processing, data-project, docker, etl, etl-pipeline, gcp, matplotlib, matplotlib-pyplot, numpy, pandas, pubsub, python, scipy, seaborn
- Language: Python
- Homepage:
- Size: 1.17 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π₯ ETL Movies Data Project
## **π Project Overview**
Welcome to the ETL Movies Data Project! π This project is a deep dive into building an end-to-end ETL (Extract, Transform, Load) pipeline using Python, Docker, and a Google Cloud Pub/Sub emulator. Weβre working with a dataset of movie ratings, transforming it into a format that's ready for analysis, and loading it into a Docker container for easy access and management. This project is perfect for those looking to simulate a real-world ETL process in a local environment. π»
---
## **π Project Structure**
Hereβs how our project is organized:
### `ETL_MOVIES/`
- **`data/`** # Where all the magic data lives! π©
- **`ratings.csv`** # Contains rated movie data π
- **`movies.csv`** # Contains information about movies π₯
- **`full_data.csv`** # The preprocessed movie data file π- **`Dockerfile`** # Our recipe for the Docker environment π¦
- **`requirements.txt`** # All the ingredients (dependencies) π
- **`README.md`** # This very guide youβre reading! π
- **`Scripts/`** # Scripts to automate various tasks π
- **`setup_env.bat`** # Environment setup script βοΈ
- **`download_data.bat`** # Data download script β¬οΈ
- **`start_emulator.py`** # Start the Pub/Sub emulator π
- **`create_topic_subscription.py`** # Create Pub/Sub topic and subscription π
- **`publish_test_message.py`** # Test data ingestion π§ͺ
- **`process_data.py`** # Extract CSV files π
- **`preprocessing_data.py`** # Clean up the data π§Ό
- **`publish_data.py`** # Publish data to the container πFeel free to explore each part of the project to understand its role and how everything fits together. Happy coding! π©βπ»π¨βπ»
## **π§ Tools and Technologies**
- **Python** π: The core language for our scripts.
- **Docker** π³: To containerize and run everything smoothly.
- **Google Cloud Pub/Sub** βοΈ: For simulating real-time data streaming.
- **Pandas** πΌ: Our go-to for data manipulation.
- **Windows Batch Scripting** π₯: Automating the setup and downloads.
- **Google Cloud SDK** π: For interacting with Google Cloud services.---
## **π Environment Setup**
Ready to get started? Hereβs what you need to do:
1. **Install Required Tools:**
- Docker Desktop
- Python 3.12.5
- Google Cloud SDK2. **Clone the Repository:**
- Use Git to clone the repository and dive into the project directory.3. **Install Python Dependencies:**
- Create a virtual environment and install dependencies with a flick of a command.---
## **π Project Steps**
Hereβs a step-by-step guide to get you through the project:
### **1οΈβ£ Setup Environment**
**Goal:** Set up the project environment with all the necessary dependencies.
**How:** Run the `setup_env.bat` script, and let the automation magic happen! β¨
---
### **2οΈβ£ Download Data**
**Goal:** Get the movie ratings data.
**How:** Simply run `download_data.bat`, and the data will be at your service! π₯
---
### **3οΈβ£ Set Up Pub/Sub Emulator**
**Goal:** Simulate the Pub/Sub environment locally.
**How:** Kickstart the emulator with `start_emulator.py`. π
---
### **4οΈβ£ Build Docker Image**
**Goal:** Package everything into a Docker image.
**How:** Build the image using Docker, and watch it come to life! π
---
### **5οΈβ£ Run Docker Container**
**Goal:** Spin up the Docker container.
**How:** Use the Docker run command, and let the container do its thing. πββοΈ
---
### **6οΈβ£ Create Pub/Sub Topic and Subscription**
**Goal:** Create a Pub/Sub topic and subscription.
**How:** Execute `create_topic_subscription.py`, and set the stage for data flow. π
---
### **7οΈβ£ Test Data Ingestion**
**Goal:** Ensure data ingestion works smoothly.
**How:** Run `publish_test_message.py` and see the messages flow! π―
---
### **8οΈβ£ Extract CSV Files**
**Goal:** Extract and prepare the data.
**How:** Run `process_data.py` and get your CSVs ready for action! π
---
### **9οΈβ£ Preprocess Data**
**Goal:** Clean and prep the data.
**How:** Execute `preprocessing_data.py`, and your data will be spotless! π§Ό
- inside this Python file:
- null values have been filled
- datatype correction
- merge data depending on the item_id---
### **π Create Folder in Docker Container**
**Goal:** Create a place in the container for our data.
**How:** Access the Docker terminal and create the `/data` folder. π
---
### **1οΈβ£1οΈβ£ Publish Data to Container**
**Goal:** Send the data into the Docker container.
**How:** Run `publish_data.py` and watch the data transfer! π
---
## **π‘ Understanding Pub/Sub and Its Role**
Google Cloud Pub/Sub is all about handling data in real-time, and this project, allows us to simulate how large-scale data processing would work in the cloud. The Pub/Sub emulator lets us develop and test everything locally, so weβre ready for the real cloud when the time comes. βοΈ
---
## **π§ Challenges Faced**
- **Handling Big Data:** Processing 100,000 rows was a challenge, but we conquered it! πͺ
- **Local Cloud Simulation:** Setting up the Pub/Sub emulator wasnβt easy, but it was worth it. π
- **Data Cleaning:** Ensuring clean and reliable data required some serious attention to detail. π§Ή---
## ** β Result**
![image](https://github.com/user-attachments/assets/736f8cd2-bab3-4306-8e24-a3d266961f41)
---
## **π Conclusion**
This project fully demonstrates how to build a robust ETL pipeline, complete with Dockerization and cloud simulations. Whether youβre here to learn or to build, this project has all the tools and guidance you need. Happy coding! π©βπ»π¨βπ»
---
## **π Repository**
Find everything you need in our [GitHub repository](https://github.com/nadahamdy217/Movies-Data-ETL-using-Python-GCP/tree/main). Dive in, explore, and feel free to contribute! π
---
## **Contributing**
Contributions are welcomed to this project! If youβd like to contribute or have any questions, please contact:
- **Author:** Nada Hamdy Fatehy
- **Email:** [email protected]
- **LinkedIn:** [LinkedIn](https://www.linkedin.com/in/nada-hamdy-2265692a3/)
- **GitHub:** [GitHub](https://github.com/nadahamdy217)