Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tuancamtbtx/dataplatform-stack

How to build a complete Data Platform -> Here
https://github.com/tuancamtbtx/dataplatform-stack

airflow cdc data data-warehouse datalake dataplatform dbt flink k8s kafka spark-streaming

Last synced: 2 days ago
JSON representation

How to build a complete Data Platform -> Here

Awesome Lists containing this project

README

        

# Building a Complete Dataplatform
Synthesize knowledge related to building a complete data platform system
### Clouds:

![Azure](https://img.shields.io/badge/azure-%230072C6.svg?style=for-the-badge&logo=microsoftazure&logoColor=white)
![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)
![Google Cloud](https://img.shields.io/badge/GoogleCloud-%234285F4.svg?style=for-the-badge&logo=google-cloud&logoColor=white)
### On primise
![Apache Spark](https://img.shields.io/badge/Apache%20Spark-FDEE21?style=flat-square&logo=apachespark&logoColor=black)
![Apache Hadoop](https://img.shields.io/badge/Apache%20Hadoop-66CCFF?style=for-the-badge&logo=apachehadoop&logoColor=black)

## Main Stack
- Data Ingestion
- Data Processing & Transformation
- Data Governance & Data Catalogs
- Data Warehouse & Datalake
- Data Analytics

![alt text](./assets/dataplatform.gif)

## Tools for Big Data Engineer
### Workflow Schedule
1. Airflow

## Data Ingestion
### Batch Ingestion
1. SASS Tool: Fivetran, Hevo Data ..
2. Opensource Tools: Airbyte, Singer, Streamsets
3. Custom Data Ingestion built in on orchestration engines like: Python + Airflow, Java Application, Other ...
### Streaming Ingestion
1. Apache Spark
2. Apache Flink
### CDC (Change Data Capture)
1. Debezium
![cdc](./assets/cdc_debezium_server.gif)
## Data Transformation
### Batch
1. DBT (Data Build Tool)
2. Apache Spark
3. Apache Flink
### Streaming
1. Apache Spark
2. Apache Flink

## Data Warehouse & Lake
### Data Warehouse Storage
1. Hadoop
2. Bigquery
3. Redshift
4. Snowflake

### Data Lake Storage
1. Hadoop (On primise)
2. Google Cloud Storage (GCP)
3. S3 (AWS)
## Data Governance
1. Apache Atlas
2. Azure Microsoft Purview
3. Data Catalog(GCP)
4. Unity Catalog
## Data Analysis

1. Metabase
2. Superset
3. PowerBI
4. Data Looker
5. Tableau

## MLOps
1. Kubeflow
2. Minio
## Contact Me
- 😀 LinkedIn: https://www.linkedin.com/tuanbacam
- 🌱 Email: [email protected]
- 🇻🇳 Country: VietNam