https://github.com/priyanshubiswas-tech/data-101

Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.
https://github.com/priyanshubiswas-tech/data-101

aws azure ci-cd data-architecture data-engineering etl-pipeline hadoop leetcode python spark sql

Last synced: 6 months ago
JSON representation

Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.

Host: GitHub
URL: https://github.com/priyanshubiswas-tech/data-101
Owner: priyanshubiswas-tech
Created: 2025-05-21T03:43:48.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-11-16T15:04:18.000Z (8 months ago)
Last Synced: 2025-11-16T17:08:54.516Z (8 months ago)
Topics: aws, azure, ci-cd, data-architecture, data-engineering, etl-pipeline, hadoop, leetcode, python, spark, sql
Language: Jupyter Notebook
Homepage:
Size: 49.5 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Data Engineering 101

A comprehensive overview of the end-to-end data engineering lifecycle — from data collection to analytics, monitoring, and governance.

---

## Table of Contents

1. [Data Engineering Process Overview](#data-engineering-process-overview)  

2. [End-to-End Data Pipeline Architecture](#end-to-end-data-pipeline-architecture)  

3. [Data Sources](#data-sources)  

4. [Data Ingestion](#data-ingestion)  

5. [Data Storage (Raw / Landing Zone)](#data-storage-raw--landing-zone)  

6. [Data Processing](#data-processing)  

7. [Data Transformation and Cleaning](#data-transformation-and-cleaning)  

8. [Transformed Data Storage (Processed Zone)](#transformed-data-storage-processed-zone)  

9. [Data Modeling](#data-modeling)  

10. [Data Serving / BI Layer](#data-serving--bi-layer)  

11. [Monitoring and Logging](#monitoring-and-logging)  

12. [Data Governance and Quality](#data-governance-and-quality)  

13. [Summary Flow](#summary-flow)

---

## Data Engineering Process Overview

This document outlines the complete data engineering lifecycle, from raw data collection through processing, transformation, modeling, and governance.

---

## End-to-End Data Pipeline Architecture

The architecture below represents the entire flow of data — starting at ingestion from multiple sources and ending with analytics and governance.

---

## Data Sources

Raw data is collected from multiple systems across various formats and technologies.

| Source Type | Examples                             |

|--------------|--------------------------------------|

| Databases    | MySQL, PostgreSQL, MongoDB           |

| APIs         | REST, GraphQL                        |

| Files        | CSV, JSON, Parquet                   |

| Other        | IoT Devices, Application Logs        |

---

## Data Ingestion

Data ingestion refers to collecting and moving data to a central location for further processing.

| Mode       | Tools                                      |

|-------------|--------------------------------------------|

| Batch       | Apache Airflow, AWS Glue, Azure Data Factory |

| Real-Time   | Kafka, Apache Flume, AWS Kinesis, NiFi      |

Purpose: To bring data from multiple heterogeneous sources into a unified system efficiently.

---

## Data Storage (Raw / Landing Zone)

Unprocessed data is stored in a raw storage layer, often referred to as a data lake.

| Storage Type     | Examples                                 |

|------------------|------------------------------------------|

| Cloud Storage    | AWS S3, Azure Blob, Google Cloud Storage |

| Distributed FS   | HDFS (Hadoop Distributed File System)    |

Also Known As: Data Lake

---

## Data Processing

Raw data is processed to make it analyzable. This includes operations like filtering, merging, and aggregation.

| Type      | Description             | Tools                                    |

|------------|-------------------------|------------------------------------------|

| Batch      | Periodic large-scale jobs | Apache Spark, PySpark, Hive, Presto     |

| Streaming  | Continuous data flows     | Apache Flink, Google Dataflow           |

---

## Data Transformation and Cleaning

Data is cleaned, validated, and enriched to ensure consistency, accuracy, and usability.

| Tools                          | Common Activities                            |

|--------------------------------|----------------------------------------------|

| Python (Pandas), SQL, DBT, PySpark | Remove null values, join datasets, validate schemas |

---

## Transformed Data Storage (Processed Zone)

This layer stores structured and validated data optimized for analysis or downstream querying.

| Storage Type    | Tools                                         |

|-----------------|-----------------------------------------------|

| Data Warehouses | Amazon Redshift, Snowflake, BigQuery, Azure Synapse |

Purpose: Enable fast querying for analytics and reporting needs.

---

## Data Modeling

The structured data is organized into schemas designed for analysis.

| Schema Type   | Description                                  |

|----------------|----------------------------------------------|

| Star Schema    | Central fact table with supporting dimensions |

| Snowflake      | Normalized, multi-level structure with additional joins |

Tools: SQL, DBT

---

## Data Serving / BI Layer

This layer exposes transformed data to end users through analytical tools and dashboards.

| Tool        | Purpose                                    |

|--------------|--------------------------------------------|

| Apache Superset | Open-source dashboarding and visualization |

| Tableau      | Interactive visual analytics               |

| Power BI     | Business intelligence reporting            |

| Looker       | Data exploration and modeling platform     |

Used By: Data analysts, business stakeholders, and executives.

---

## Monitoring and Logging

Monitoring ensures that each component of the pipeline functions correctly and that failures are tracked.

| Tool         | Use Case                                      |

|---------------|-----------------------------------------------|

| Airflow UI    | Monitor and debug pipeline workflows          |

| Grafana       | Time-series visualization and alerting        |

| AWS CloudWatch| Metrics, logs, and custom alerts              |

---

## Data Governance and Quality

Guarantees data security, compliance, and reliability across the entire ecosystem.

| Focus Area         | Tools / Techniques                                   |

|--------------------|------------------------------------------------------|

| Access Management  | Role-based access controls, audit trails             |

| Data Quality       | Great Expectations, Monte Carlo, Soda Core           |

| Sensitive Data     | AWS Macie, Dataplex, Schema Validation (HIPAA/GDPR)  |

---

## Summary Flow

- Data Sources

- Data Ingestion

- Raw Storage (Data Lake)

- Data Processing (Batch / Streaming)

- Data Transformation & Cleaning

- Processed Storage (Data Warehouse)

- Data Modeling

- BI / Dashboards

- Monitoring & Logging

- Data Governance & Quality

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/priyanshubiswas-tech/data-101

Awesome Lists containing this project

README