{"id":29943081,"url":"https://github.com/priyanshubiswas-tech/data-101","last_synced_at":"2026-01-25T06:37:28.766Z","repository":{"id":294576494,"uuid":"987417881","full_name":"priyanshubiswas-tech/Data-101","owner":"priyanshubiswas-tech","description":"Comprehensive Data Engineering prep repository covering concepts, LeetCode, demos, and projects on SQL, Spark, Hadoop, ETL, Data Warehousing, and more.","archived":false,"fork":false,"pushed_at":"2025-11-16T15:04:18.000Z","size":51924,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-16T17:08:54.516Z","etag":null,"topics":["aws","azure","ci-cd","data-architecture","data-engineering","etl-pipeline","hadoop","leetcode","python","spark","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/priyanshubiswas-tech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-21T03:43:48.000Z","updated_at":"2025-11-16T15:04:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"18004d00-a8eb-490b-b8ec-f2f673cdf63e","html_url":"https://github.com/priyanshubiswas-tech/Data-101","commit_stats":null,"previous_names":["priyanshubiswas-tech/data-engineering-101","priyanshubiswas-tech/data-101"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/priyanshubiswas-tech/Data-101","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyanshubiswas-tech%2FData-101","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyanshubiswas-tech%2FData-101/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyanshubiswas-tech%2FData-101/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyanshubiswas-tech%2FData-101/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/priyanshubiswas-tech","download_url":"https://codeload.github.com/priyanshubiswas-tech/Data-101/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyanshubiswas-tech%2FData-101/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28746705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T05:12:38.112Z","status":"ssl_error","status_checked_at":"2026-01-25T05:04:50.338Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","azure","ci-cd","data-architecture","data-engineering","etl-pipeline","hadoop","leetcode","python","spark","sql"],"created_at":"2025-08-03T02:13:23.960Z","updated_at":"2026-01-25T06:37:28.761Z","avatar_url":"https://github.com/priyanshubiswas-tech.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineering 101\n\nA comprehensive overview of the end-to-end data engineering lifecycle — from data collection to analytics, monitoring, and governance.\n\n---\n\n## Table of Contents\n\n1. [Data Engineering Process Overview](#data-engineering-process-overview)  \n2. [End-to-End Data Pipeline Architecture](#end-to-end-data-pipeline-architecture)  \n3. [Data Sources](#data-sources)  \n4. [Data Ingestion](#data-ingestion)  \n5. [Data Storage (Raw / Landing Zone)](#data-storage-raw--landing-zone)  \n6. [Data Processing](#data-processing)  \n7. [Data Transformation and Cleaning](#data-transformation-and-cleaning)  \n8. [Transformed Data Storage (Processed Zone)](#transformed-data-storage-processed-zone)  \n9. [Data Modeling](#data-modeling)  \n10. [Data Serving / BI Layer](#data-serving--bi-layer)  \n11. [Monitoring and Logging](#monitoring-and-logging)  \n12. [Data Governance and Quality](#data-governance-and-quality)  \n13. [Summary Flow](#summary-flow)\n\n---\n\n## Data Engineering Process Overview\n\nThis document outlines the complete data engineering lifecycle, from raw data collection through processing, transformation, modeling, and governance.\n\n---\n\n## End-to-End Data Pipeline Architecture\n\nThe architecture below represents the entire flow of data — starting at ingestion from multiple sources and ending with analytics and governance.\n\n---\n\n## Data Sources\n\nRaw data is collected from multiple systems across various formats and technologies.\n\n| Source Type | Examples                             |\n|--------------|--------------------------------------|\n| Databases    | MySQL, PostgreSQL, MongoDB           |\n| APIs         | REST, GraphQL                        |\n| Files        | CSV, JSON, Parquet                   |\n| Other        | IoT Devices, Application Logs        |\n\n---\n\n## Data Ingestion\n\nData ingestion refers to collecting and moving data to a central location for further processing.\n\n| Mode       | Tools                                      |\n|-------------|--------------------------------------------|\n| Batch       | Apache Airflow, AWS Glue, Azure Data Factory |\n| Real-Time   | Kafka, Apache Flume, AWS Kinesis, NiFi      |\n\nPurpose: To bring data from multiple heterogeneous sources into a unified system efficiently.\n\n---\n\n## Data Storage (Raw / Landing Zone)\n\nUnprocessed data is stored in a raw storage layer, often referred to as a data lake.\n\n| Storage Type     | Examples                                 |\n|------------------|------------------------------------------|\n| Cloud Storage    | AWS S3, Azure Blob, Google Cloud Storage |\n| Distributed FS   | HDFS (Hadoop Distributed File System)    |\n\nAlso Known As: Data Lake\n\n---\n\n## Data Processing\n\nRaw data is processed to make it analyzable. This includes operations like filtering, merging, and aggregation.\n\n| Type      | Description             | Tools                                    |\n|------------|-------------------------|------------------------------------------|\n| Batch      | Periodic large-scale jobs | Apache Spark, PySpark, Hive, Presto     |\n| Streaming  | Continuous data flows     | Apache Flink, Google Dataflow           |\n\n---\n\n## Data Transformation and Cleaning\n\nData is cleaned, validated, and enriched to ensure consistency, accuracy, and usability.\n\n| Tools                          | Common Activities                            |\n|--------------------------------|----------------------------------------------|\n| Python (Pandas), SQL, DBT, PySpark | Remove null values, join datasets, validate schemas |\n\n---\n\n## Transformed Data Storage (Processed Zone)\n\nThis layer stores structured and validated data optimized for analysis or downstream querying.\n\n| Storage Type    | Tools                                         |\n|-----------------|-----------------------------------------------|\n| Data Warehouses | Amazon Redshift, Snowflake, BigQuery, Azure Synapse |\n\nPurpose: Enable fast querying for analytics and reporting needs.\n\n---\n\n## Data Modeling\n\nThe structured data is organized into schemas designed for analysis.\n\n| Schema Type   | Description                                  |\n|----------------|----------------------------------------------|\n| Star Schema    | Central fact table with supporting dimensions |\n| Snowflake      | Normalized, multi-level structure with additional joins |\n\nTools: SQL, DBT\n\n---\n\n## Data Serving / BI Layer\n\nThis layer exposes transformed data to end users through analytical tools and dashboards.\n\n| Tool        | Purpose                                    |\n|--------------|--------------------------------------------|\n| Apache Superset | Open-source dashboarding and visualization |\n| Tableau      | Interactive visual analytics               |\n| Power BI     | Business intelligence reporting            |\n| Looker       | Data exploration and modeling platform     |\n\nUsed By: Data analysts, business stakeholders, and executives.\n\n---\n\n## Monitoring and Logging\n\nMonitoring ensures that each component of the pipeline functions correctly and that failures are tracked.\n\n| Tool         | Use Case                                      |\n|---------------|-----------------------------------------------|\n| Airflow UI    | Monitor and debug pipeline workflows          |\n| Grafana       | Time-series visualization and alerting        |\n| AWS CloudWatch| Metrics, logs, and custom alerts              |\n\n---\n\n## Data Governance and Quality\n\nGuarantees data security, compliance, and reliability across the entire ecosystem.\n\n| Focus Area         | Tools / Techniques                                   |\n|--------------------|------------------------------------------------------|\n| Access Management  | Role-based access controls, audit trails             |\n| Data Quality       | Great Expectations, Monte Carlo, Soda Core           |\n| Sensitive Data     | AWS Macie, Dataplex, Schema Validation (HIPAA/GDPR)  |\n\n---\n\n## Summary Flow\n\n- Data Sources\n- Data Ingestion\n- Raw Storage (Data Lake)\n- Data Processing (Batch / Streaming)\n- Data Transformation \u0026 Cleaning\n- Processed Storage (Data Warehouse)\n- Data Modeling\n- BI / Dashboards\n- Monitoring \u0026 Logging\n- Data Governance \u0026 Quality\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriyanshubiswas-tech%2Fdata-101","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpriyanshubiswas-tech%2Fdata-101","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriyanshubiswas-tech%2Fdata-101/lists"}