Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tuancamtbtx/dataplatform-stack
How to build a complete Data Platform -> Here
https://github.com/tuancamtbtx/dataplatform-stack
airflow cdc data data-warehouse datalake dataplatform dbt flink k8s kafka spark-streaming
Last synced: 2 days ago
JSON representation
How to build a complete Data Platform -> Here
- Host: GitHub
- URL: https://github.com/tuancamtbtx/dataplatform-stack
- Owner: tuancamtbtx
- Created: 2023-09-03T15:11:26.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-04T04:12:51.000Z (6 months ago)
- Last Synced: 2024-11-09T02:37:59.423Z (about 2 months ago)
- Topics: airflow, cdc, data, data-warehouse, datalake, dataplatform, dbt, flink, k8s, kafka, spark-streaming
- Language: Python
- Homepage:
- Size: 7.57 MB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Building a Complete Dataplatform
Synthesize knowledge related to building a complete data platform system
### Clouds:![Azure](https://img.shields.io/badge/azure-%230072C6.svg?style=for-the-badge&logo=microsoftazure&logoColor=white)
![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)
![Google Cloud](https://img.shields.io/badge/GoogleCloud-%234285F4.svg?style=for-the-badge&logo=google-cloud&logoColor=white)
### On primise
![Apache Spark](https://img.shields.io/badge/Apache%20Spark-FDEE21?style=flat-square&logo=apachespark&logoColor=black)
![Apache Hadoop](https://img.shields.io/badge/Apache%20Hadoop-66CCFF?style=for-the-badge&logo=apachehadoop&logoColor=black)## Main Stack
- Data Ingestion
- Data Processing & Transformation
- Data Governance & Data Catalogs
- Data Warehouse & Datalake
- Data Analytics![alt text](./assets/dataplatform.gif)
## Tools for Big Data Engineer
### Workflow Schedule
1. Airflow## Data Ingestion
### Batch Ingestion
1. SASS Tool: Fivetran, Hevo Data ..
2. Opensource Tools: Airbyte, Singer, Streamsets
3. Custom Data Ingestion built in on orchestration engines like: Python + Airflow, Java Application, Other ...
### Streaming Ingestion
1. Apache Spark
2. Apache Flink
### CDC (Change Data Capture)
1. Debezium
![cdc](./assets/cdc_debezium_server.gif)
## Data Transformation
### Batch
1. DBT (Data Build Tool)
2. Apache Spark
3. Apache Flink
### Streaming
1. Apache Spark
2. Apache Flink## Data Warehouse & Lake
### Data Warehouse Storage
1. Hadoop
2. Bigquery
3. Redshift
4. Snowflake### Data Lake Storage
1. Hadoop (On primise)
2. Google Cloud Storage (GCP)
3. S3 (AWS)
## Data Governance
1. Apache Atlas
2. Azure Microsoft Purview
3. Data Catalog(GCP)
4. Unity Catalog
## Data Analysis1. Metabase
2. Superset
3. PowerBI
4. Data Looker
5. Tableau## MLOps
1. Kubeflow
2. Minio
## Contact Me
- 😀 LinkedIn: https://www.linkedin.com/tuanbacam
- 🌱 Email: [email protected]
- 🇻🇳 Country: VietNam