An open API service indexing awesome lists of open source software.

https://github.com/dadananjesha/redshift-etl-project

The project covers the complete data pipelineโ€”from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.
https://github.com/dadananjesha/redshift-etl-project

apache-spark aws data-engineering-etl-assignment data-ingestion data-pipeline etl-processes hdfs rds redshift spark sqoop

Last synced: 26 days ago
JSON representation

The project covers the complete data pipelineโ€”from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

Awesome Lists containing this project

README

          

# Redshift-ETL-Project ๐Ÿš€๐Ÿ”ง

[![Python Version](https://img.shields.io/badge/Python-3.8%2B-blue.svg)](https://www.python.org/) [![Apache Spark](https://img.shields.io/badge/Apache%20Spark-3.0%2B-orange.svg)](https://spark.apache.org/) [![AWS Redshift](https://img.shields.io/badge/AWS%20Redshift-FF9900?style=for-the-badge&logo=amazonaws&logoColor=white)](https://aws.amazon.com/redshift/) [![Hadoop](https://img.shields.io/badge/Hadoop-3.x%20-green.svg)](https://hadoop.apache.org/) [![Sqoop](https://img.shields.io/badge/Sqoop-1.4.7-brightgreen.svg)](https://sqoop.apache.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

**Data Engineering ETL Project** is a comprehensive project demonstrating data ingestion, ETL processes, and analytical querying using AWS Redshift, Apache Spark, and Sqoop. The project covers the complete data pipelineโ€”from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

---

## ๐Ÿ“– Overview

This project is designed to showcase a real-world ETL workflow for a data engineering assignment:
- **Data Ingestion:** Import data from an RDS (MySQL) database to HDFS using Sqoop.
- **ETL Processing:** Use Apache Spark for data transformation and loading.
- **Analytical Queries:** Execute complex analytical queries on an AWS Redshift cluster to derive insights.

The provided documents include detailed Redshift queries, cluster setup screenshots, Spark ETL code, and Sqoop data ingestion commands.

---

## ๐Ÿ› ๏ธ Technologies & Tools



Python


Apache Spark


AWS Redshift


Hadoop


Sqoop

---

## ๐Ÿ”„ Data Flow Diagram

```mermaid
flowchart TD
A[๐Ÿ—„๏ธ RDS - MySQL] --> B[๐Ÿ“ฅ Sqoop Import]
B --> C[๐Ÿ“ HDFS]
C --> D[๐Ÿ”„ Spark ETL Processing]
D --> E[๐Ÿ“ค Data Load]
E --> F[AWS Redshift]
F --> G[๐Ÿ” Analytical Queries]
```

---

## ๐Ÿ—‚๏ธ Project Structure

```plaintext
DataEngineeringETL/
โ”œโ”€โ”€ RedshiftQueries.pdf # PDF containing analytical queries for the Redshift cluster
โ”œโ”€โ”€ RedshiftSetup.pdf # PDF with screenshots and details on setting up the Redshift cluster
โ”œโ”€โ”€ SparkETLCode.ipynb # Jupyter Notebook with Spark ETL code and transformation logic
โ”œโ”€โ”€ SqoopDataIngestion.pdf # PDF outlining the Sqoop import commands and HDFS data inspection
โ””โ”€โ”€ README.md # Project documentation (this file)
```

---

## ๐Ÿ’ป Setup & Deployment

### Prerequisites

- **AWS Account:** For setting up Redshift and S3.
- **RDS MySQL Instance:** Source of data.
- **Hadoop Cluster:** For HDFS (local or cloud-based).
- **Apache Sqoop & Spark:** Installed on your data processing cluster.

### Setup Steps

1. **Data Ingestion with Sqoop:**
- Use the Sqoop commands detailed in `SqoopDataIngestion.pdf` to import tables from RDS into HDFS.
- Verify data import using Hadoop FS commands.

2. **ETL Processing with Spark:**
- Open `SparkETLCode.ipynb` in Jupyter Notebook.
- Follow the ETL workflow to clean, transform, and load data.

3. **Redshift Cluster Setup:**
- Follow the guidelines in `RedshiftSetup.pdf` to create a Redshift cluster and configure databases/tables.
- Execute the SQL queries from `RedshiftQueries.pdf` on the AWS Redshift Query Editor.

---

## ๐Ÿš€ Usage

- **Run ETL:**
Execute the Spark ETL Notebook (`SparkETLCode.ipynb`) to process and prepare data.

- **Load & Query Data:**
Load the transformed data into Redshift and run analytical queries to generate insights.

- **Review Documentation:**
Refer to the PDF files for detailed instructions on Redshift setup, query execution, and Sqoop data ingestion.

---

## โญ๏ธ Call-to-Action

If you find this project useful, please consider:
- **Starring** the repository โญ
- **Forking** to contribute improvements or customizations
- **Following** for updates on similar data engineering projects

Your engagement is greatly appreciated and helps boost visibility!

---

## ๐Ÿ“œ License

This project is licensed under the [MIT License](LICENSE).

---

## ๐Ÿ™ Acknowledgements

- **AWS & Azure:** For providing robust cloud infrastructure.
- **Data Engineering Community:** For continuous inspiration and support.

---

*Happy Data Engineering! ๐Ÿš€๐Ÿ”ง*