https://github.com/priyanshubiswas-tech/data-101
Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.
https://github.com/priyanshubiswas-tech/data-101
aws azure ci-cd data-architecture data-engineering etl-pipeline hadoop leetcode python spark sql
Last synced: 2 months ago
JSON representation
Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.
- Host: GitHub
- URL: https://github.com/priyanshubiswas-tech/data-101
- Owner: priyanshubiswas-tech
- Created: 2025-05-21T03:43:48.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-11-16T15:04:18.000Z (4 months ago)
- Last Synced: 2025-11-16T17:08:54.516Z (4 months ago)
- Topics: aws, azure, ci-cd, data-architecture, data-engineering, etl-pipeline, hadoop, leetcode, python, spark, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 49.5 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Engineering 101
A comprehensive overview of the end-to-end data engineering lifecycle — from data collection to analytics, monitoring, and governance.
---
## Table of Contents
1. [Data Engineering Process Overview](#data-engineering-process-overview)
2. [End-to-End Data Pipeline Architecture](#end-to-end-data-pipeline-architecture)
3. [Data Sources](#data-sources)
4. [Data Ingestion](#data-ingestion)
5. [Data Storage (Raw / Landing Zone)](#data-storage-raw--landing-zone)
6. [Data Processing](#data-processing)
7. [Data Transformation and Cleaning](#data-transformation-and-cleaning)
8. [Transformed Data Storage (Processed Zone)](#transformed-data-storage-processed-zone)
9. [Data Modeling](#data-modeling)
10. [Data Serving / BI Layer](#data-serving--bi-layer)
11. [Monitoring and Logging](#monitoring-and-logging)
12. [Data Governance and Quality](#data-governance-and-quality)
13. [Summary Flow](#summary-flow)
---
## Data Engineering Process Overview
This document outlines the complete data engineering lifecycle, from raw data collection through processing, transformation, modeling, and governance.
---
## End-to-End Data Pipeline Architecture
The architecture below represents the entire flow of data — starting at ingestion from multiple sources and ending with analytics and governance.
---
## Data Sources
Raw data is collected from multiple systems across various formats and technologies.
| Source Type | Examples |
|--------------|--------------------------------------|
| Databases | MySQL, PostgreSQL, MongoDB |
| APIs | REST, GraphQL |
| Files | CSV, JSON, Parquet |
| Other | IoT Devices, Application Logs |
---
## Data Ingestion
Data ingestion refers to collecting and moving data to a central location for further processing.
| Mode | Tools |
|-------------|--------------------------------------------|
| Batch | Apache Airflow, AWS Glue, Azure Data Factory |
| Real-Time | Kafka, Apache Flume, AWS Kinesis, NiFi |
Purpose: To bring data from multiple heterogeneous sources into a unified system efficiently.
---
## Data Storage (Raw / Landing Zone)
Unprocessed data is stored in a raw storage layer, often referred to as a data lake.
| Storage Type | Examples |
|------------------|------------------------------------------|
| Cloud Storage | AWS S3, Azure Blob, Google Cloud Storage |
| Distributed FS | HDFS (Hadoop Distributed File System) |
Also Known As: Data Lake
---
## Data Processing
Raw data is processed to make it analyzable. This includes operations like filtering, merging, and aggregation.
| Type | Description | Tools |
|------------|-------------------------|------------------------------------------|
| Batch | Periodic large-scale jobs | Apache Spark, PySpark, Hive, Presto |
| Streaming | Continuous data flows | Apache Flink, Google Dataflow |
---
## Data Transformation and Cleaning
Data is cleaned, validated, and enriched to ensure consistency, accuracy, and usability.
| Tools | Common Activities |
|--------------------------------|----------------------------------------------|
| Python (Pandas), SQL, DBT, PySpark | Remove null values, join datasets, validate schemas |
---
## Transformed Data Storage (Processed Zone)
This layer stores structured and validated data optimized for analysis or downstream querying.
| Storage Type | Tools |
|-----------------|-----------------------------------------------|
| Data Warehouses | Amazon Redshift, Snowflake, BigQuery, Azure Synapse |
Purpose: Enable fast querying for analytics and reporting needs.
---
## Data Modeling
The structured data is organized into schemas designed for analysis.
| Schema Type | Description |
|----------------|----------------------------------------------|
| Star Schema | Central fact table with supporting dimensions |
| Snowflake | Normalized, multi-level structure with additional joins |
Tools: SQL, DBT
---
## Data Serving / BI Layer
This layer exposes transformed data to end users through analytical tools and dashboards.
| Tool | Purpose |
|--------------|--------------------------------------------|
| Apache Superset | Open-source dashboarding and visualization |
| Tableau | Interactive visual analytics |
| Power BI | Business intelligence reporting |
| Looker | Data exploration and modeling platform |
Used By: Data analysts, business stakeholders, and executives.
---
## Monitoring and Logging
Monitoring ensures that each component of the pipeline functions correctly and that failures are tracked.
| Tool | Use Case |
|---------------|-----------------------------------------------|
| Airflow UI | Monitor and debug pipeline workflows |
| Grafana | Time-series visualization and alerting |
| AWS CloudWatch| Metrics, logs, and custom alerts |
---
## Data Governance and Quality
Guarantees data security, compliance, and reliability across the entire ecosystem.
| Focus Area | Tools / Techniques |
|--------------------|------------------------------------------------------|
| Access Management | Role-based access controls, audit trails |
| Data Quality | Great Expectations, Monte Carlo, Soda Core |
| Sensitive Data | AWS Macie, Dataplex, Schema Validation (HIPAA/GDPR) |
---
## Summary Flow
- Data Sources
- Data Ingestion
- Raw Storage (Data Lake)
- Data Processing (Batch / Streaming)
- Data Transformation & Cleaning
- Processed Storage (Data Warehouse)
- Data Modeling
- BI / Dashboards
- Monitoring & Logging
- Data Governance & Quality