https://github.com/thevinh-ha-1710/big-data-pipeline-design
This project builds a data pipeline implementing the ETL process.
https://github.com/thevinh-ha-1710/big-data-pipeline-design
big-data etl-pipeline json mapreduce-python mongodb-database
Last synced: 5 months ago
JSON representation
This project builds a data pipeline implementing the ETL process.
- Host: GitHub
- URL: https://github.com/thevinh-ha-1710/big-data-pipeline-design
- Owner: TheVinh-Ha-1710
- Created: 2025-02-24T11:53:24.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-02-24T12:14:29.000Z (9 months ago)
- Last Synced: 2025-06-09T00:06:37.914Z (6 months ago)
- Topics: big-data, etl-pipeline, json, mapreduce-python, mongodb-database
- Language: Python
- Homepage:
- Size: 738 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Big Data ETL Pipelines Design
## Description
This project designs data pipelines to extract data on famous musical albums, transform it using the MapReduce technique for meaningful partitioning, and load it into text-based databases. The final output includes a report on the pipeline design.
## Features
- Data Extraction: Connects to MongoDB using pymongo to extract data on famous musical albums.
- Data Transformation: Utilizes the mrjob framework to perform MapReduce, partitioning data into meaningful datasets, such as Annual top sales, Best sellers in history, etc.
- Data Storage: Loads the transformed data into text-based databases for further analysis.
- Pipeline Report: Generates a report detailing the pipeline designs, key features, and potential improvements.
## Technologies Used
- Python: Core language for data extraction, transformation, and processing.
- MongoDB: NoSQL database for storing and retrieving album data.
- pymongo: Python library for connecting to MongoDB and extracting data.
- mrjob: Framework for running MapReduce jobs in Python.
- json: Python library for json parsing and processing.
## Installation & Setup
### Prerequisites
- Python 3.x installed
- Jupyter Notebook or a Python IDE (VS Code, PyCharm, etc.)
- Virtual environment (optional but recommended)
### Setup
1. Clone the repository:
```sh
git clone https://github.com/TheVinh-Ha-1710/Big-Data-Pipeline-Design.git
cd Big-Data-Pipeline-Design
```
2. Create and activate a virtual environment (optional but recommended):
```sh
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
3. Install dependencies:
```sh
pip install -r requirements.txt
```
4. Run the data pipeline scripts:
```sh
chmode +x run_pipelines.sh
```
## Folder Structure
```
📂 Diabetes-Predictive-Model
├── 📂 databases # Output datasets
├── 📂 pipelines # Pipeline scripts
├── 📜 README.md # Project document
├── 📜 Report.pdf # PDF Report
├── 📜 Song.json # The original dataset
├── 📜 requirements.txt # Required frameworks
├── 📜 run_pipelines.sh # Shell script to run the pipeline
```