https://github.com/dadananjesha/redshift-etl-project
The project covers the complete data pipelineโfrom importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.
https://github.com/dadananjesha/redshift-etl-project
apache-spark aws data-engineering-etl-assignment data-ingestion data-pipeline etl-processes hdfs rds redshift spark sqoop
Last synced: 26 days ago
JSON representation
The project covers the complete data pipelineโfrom importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.
- Host: GitHub
- URL: https://github.com/dadananjesha/redshift-etl-project
- Owner: DadaNanjesha
- License: mit
- Created: 2025-03-08T19:15:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-10T23:26:43.000Z (about 1 year ago)
- Last Synced: 2025-03-11T00:24:46.932Z (about 1 year ago)
- Topics: apache-spark, aws, data-engineering-etl-assignment, data-ingestion, data-pipeline, etl-processes, hdfs, rds, redshift, spark, sqoop
- Language: Jupyter Notebook
- Homepage:
- Size: 833 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Redshift-ETL-Project ๐๐ง
[](https://www.python.org/) [](https://spark.apache.org/) [](https://aws.amazon.com/redshift/) [](https://hadoop.apache.org/) [](https://sqoop.apache.org/) [](LICENSE)
**Data Engineering ETL Project** is a comprehensive project demonstrating data ingestion, ETL processes, and analytical querying using AWS Redshift, Apache Spark, and Sqoop. The project covers the complete data pipelineโfrom importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.
---
## ๐ Overview
This project is designed to showcase a real-world ETL workflow for a data engineering assignment:
- **Data Ingestion:** Import data from an RDS (MySQL) database to HDFS using Sqoop.
- **ETL Processing:** Use Apache Spark for data transformation and loading.
- **Analytical Queries:** Execute complex analytical queries on an AWS Redshift cluster to derive insights.
The provided documents include detailed Redshift queries, cluster setup screenshots, Spark ETL code, and Sqoop data ingestion commands.
---
## ๐ ๏ธ Technologies & Tools
---
## ๐ Data Flow Diagram
```mermaid
flowchart TD
A[๐๏ธ RDS - MySQL] --> B[๐ฅ Sqoop Import]
B --> C[๐ HDFS]
C --> D[๐ Spark ETL Processing]
D --> E[๐ค Data Load]
E --> F[AWS Redshift]
F --> G[๐ Analytical Queries]
```
---
## ๐๏ธ Project Structure
```plaintext
DataEngineeringETL/
โโโ RedshiftQueries.pdf # PDF containing analytical queries for the Redshift cluster
โโโ RedshiftSetup.pdf # PDF with screenshots and details on setting up the Redshift cluster
โโโ SparkETLCode.ipynb # Jupyter Notebook with Spark ETL code and transformation logic
โโโ SqoopDataIngestion.pdf # PDF outlining the Sqoop import commands and HDFS data inspection
โโโ README.md # Project documentation (this file)
```
---
## ๐ป Setup & Deployment
### Prerequisites
- **AWS Account:** For setting up Redshift and S3.
- **RDS MySQL Instance:** Source of data.
- **Hadoop Cluster:** For HDFS (local or cloud-based).
- **Apache Sqoop & Spark:** Installed on your data processing cluster.
### Setup Steps
1. **Data Ingestion with Sqoop:**
- Use the Sqoop commands detailed in `SqoopDataIngestion.pdf` to import tables from RDS into HDFS.
- Verify data import using Hadoop FS commands.
2. **ETL Processing with Spark:**
- Open `SparkETLCode.ipynb` in Jupyter Notebook.
- Follow the ETL workflow to clean, transform, and load data.
3. **Redshift Cluster Setup:**
- Follow the guidelines in `RedshiftSetup.pdf` to create a Redshift cluster and configure databases/tables.
- Execute the SQL queries from `RedshiftQueries.pdf` on the AWS Redshift Query Editor.
---
## ๐ Usage
- **Run ETL:**
Execute the Spark ETL Notebook (`SparkETLCode.ipynb`) to process and prepare data.
- **Load & Query Data:**
Load the transformed data into Redshift and run analytical queries to generate insights.
- **Review Documentation:**
Refer to the PDF files for detailed instructions on Redshift setup, query execution, and Sqoop data ingestion.
---
## โญ๏ธ Call-to-Action
If you find this project useful, please consider:
- **Starring** the repository โญ
- **Forking** to contribute improvements or customizations
- **Following** for updates on similar data engineering projects
Your engagement is greatly appreciated and helps boost visibility!
---
## ๐ License
This project is licensed under the [MIT License](LICENSE).
---
## ๐ Acknowledgements
- **AWS & Azure:** For providing robust cloud infrastructure.
- **Data Engineering Community:** For continuous inspiration and support.
---
*Happy Data Engineering! ๐๐ง*