An open API service indexing awesome lists of open source software.

https://github.com/priyanshubiswas-tech/data-101

Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.
https://github.com/priyanshubiswas-tech/data-101

aws azure ci-cd data-architecture data-engineering etl-pipeline hadoop leetcode python spark sql

Last synced: 2 months ago
JSON representation

Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.

Awesome Lists containing this project

README

          

# Data Engineering 101

A comprehensive overview of the end-to-end data engineering lifecycle — from data collection to analytics, monitoring, and governance.

---

## Table of Contents

1. [Data Engineering Process Overview](#data-engineering-process-overview)
2. [End-to-End Data Pipeline Architecture](#end-to-end-data-pipeline-architecture)
3. [Data Sources](#data-sources)
4. [Data Ingestion](#data-ingestion)
5. [Data Storage (Raw / Landing Zone)](#data-storage-raw--landing-zone)
6. [Data Processing](#data-processing)
7. [Data Transformation and Cleaning](#data-transformation-and-cleaning)
8. [Transformed Data Storage (Processed Zone)](#transformed-data-storage-processed-zone)
9. [Data Modeling](#data-modeling)
10. [Data Serving / BI Layer](#data-serving--bi-layer)
11. [Monitoring and Logging](#monitoring-and-logging)
12. [Data Governance and Quality](#data-governance-and-quality)
13. [Summary Flow](#summary-flow)

---

## Data Engineering Process Overview

This document outlines the complete data engineering lifecycle, from raw data collection through processing, transformation, modeling, and governance.

---

## End-to-End Data Pipeline Architecture

The architecture below represents the entire flow of data — starting at ingestion from multiple sources and ending with analytics and governance.

---

## Data Sources

Raw data is collected from multiple systems across various formats and technologies.

| Source Type | Examples |
|--------------|--------------------------------------|
| Databases | MySQL, PostgreSQL, MongoDB |
| APIs | REST, GraphQL |
| Files | CSV, JSON, Parquet |
| Other | IoT Devices, Application Logs |

---

## Data Ingestion

Data ingestion refers to collecting and moving data to a central location for further processing.

| Mode | Tools |
|-------------|--------------------------------------------|
| Batch | Apache Airflow, AWS Glue, Azure Data Factory |
| Real-Time | Kafka, Apache Flume, AWS Kinesis, NiFi |

Purpose: To bring data from multiple heterogeneous sources into a unified system efficiently.

---

## Data Storage (Raw / Landing Zone)

Unprocessed data is stored in a raw storage layer, often referred to as a data lake.

| Storage Type | Examples |
|------------------|------------------------------------------|
| Cloud Storage | AWS S3, Azure Blob, Google Cloud Storage |
| Distributed FS | HDFS (Hadoop Distributed File System) |

Also Known As: Data Lake

---

## Data Processing

Raw data is processed to make it analyzable. This includes operations like filtering, merging, and aggregation.

| Type | Description | Tools |
|------------|-------------------------|------------------------------------------|
| Batch | Periodic large-scale jobs | Apache Spark, PySpark, Hive, Presto |
| Streaming | Continuous data flows | Apache Flink, Google Dataflow |

---

## Data Transformation and Cleaning

Data is cleaned, validated, and enriched to ensure consistency, accuracy, and usability.

| Tools | Common Activities |
|--------------------------------|----------------------------------------------|
| Python (Pandas), SQL, DBT, PySpark | Remove null values, join datasets, validate schemas |

---

## Transformed Data Storage (Processed Zone)

This layer stores structured and validated data optimized for analysis or downstream querying.

| Storage Type | Tools |
|-----------------|-----------------------------------------------|
| Data Warehouses | Amazon Redshift, Snowflake, BigQuery, Azure Synapse |

Purpose: Enable fast querying for analytics and reporting needs.

---

## Data Modeling

The structured data is organized into schemas designed for analysis.

| Schema Type | Description |
|----------------|----------------------------------------------|
| Star Schema | Central fact table with supporting dimensions |
| Snowflake | Normalized, multi-level structure with additional joins |

Tools: SQL, DBT

---

## Data Serving / BI Layer

This layer exposes transformed data to end users through analytical tools and dashboards.

| Tool | Purpose |
|--------------|--------------------------------------------|
| Apache Superset | Open-source dashboarding and visualization |
| Tableau | Interactive visual analytics |
| Power BI | Business intelligence reporting |
| Looker | Data exploration and modeling platform |

Used By: Data analysts, business stakeholders, and executives.

---

## Monitoring and Logging

Monitoring ensures that each component of the pipeline functions correctly and that failures are tracked.

| Tool | Use Case |
|---------------|-----------------------------------------------|
| Airflow UI | Monitor and debug pipeline workflows |
| Grafana | Time-series visualization and alerting |
| AWS CloudWatch| Metrics, logs, and custom alerts |

---

## Data Governance and Quality

Guarantees data security, compliance, and reliability across the entire ecosystem.

| Focus Area | Tools / Techniques |
|--------------------|------------------------------------------------------|
| Access Management | Role-based access controls, audit trails |
| Data Quality | Great Expectations, Monte Carlo, Soda Core |
| Sensitive Data | AWS Macie, Dataplex, Schema Validation (HIPAA/GDPR) |

---

## Summary Flow

- Data Sources
- Data Ingestion
- Raw Storage (Data Lake)
- Data Processing (Batch / Streaming)
- Data Transformation & Cleaning
- Processed Storage (Data Warehouse)
- Data Modeling
- BI / Dashboards
- Monitoring & Logging
- Data Governance & Quality