{"id":30426207,"url":"https://github.com/tuanai-vireox/dataplatform-stack","last_synced_at":"2025-08-22T12:24:34.992Z","repository":{"id":194219544,"uuid":"686670726","full_name":"tuanai-vireox/dataplatform-stack","owner":"tuanai-vireox","description":"How to build a complete Data Platform -\u003e Here","archived":false,"fork":false,"pushed_at":"2024-07-04T04:12:51.000Z","size":7938,"stargazers_count":5,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-08-01T08:19:00.363Z","etag":null,"topics":["airflow","cdc","data","data-warehouse","datalake","dataplatform","dbt","flink","k8s","kafka","spark-streaming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tuanai-vireox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-09-03T15:11:26.000Z","updated_at":"2024-12-14T12:39:46.000Z","dependencies_parsed_at":"2023-09-22T11:36:31.489Z","dependency_job_id":"5c71dbc0-58f2-4807-89a3-5721370623ba","html_url":"https://github.com/tuanai-vireox/dataplatform-stack","commit_stats":null,"previous_names":["tuancamtbtx/dataplatform-stack","tuanai-vireox/dataplatform-stack"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tuanai-vireox/dataplatform-stack","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuanai-vireox%2Fdataplatform-stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuanai-vireox%2Fdataplatform-stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuanai-vireox%2Fdataplatform-stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuanai-vireox%2Fdataplatform-stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tuanai-vireox","download_url":"https://codeload.github.com/tuanai-vireox/dataplatform-stack/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tuanai-vireox%2Fdataplatform-stack/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271636052,"owners_count":24794147,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","cdc","data","data-warehouse","datalake","dataplatform","dbt","flink","k8s","kafka","spark-streaming"],"created_at":"2025-08-22T12:24:31.085Z","updated_at":"2025-08-22T12:24:34.939Z","avatar_url":"https://github.com/tuanai-vireox.png","language":"Python","readme":"# Building a Complete Dataplatform\nSynthesize knowledge related to building a complete data platform system\n### Clouds:\n\n![Azure](https://img.shields.io/badge/azure-%230072C6.svg?style=for-the-badge\u0026logo=microsoftazure\u0026logoColor=white)\n![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge\u0026logo=amazon-aws\u0026logoColor=white)\n![Google Cloud](https://img.shields.io/badge/GoogleCloud-%234285F4.svg?style=for-the-badge\u0026logo=google-cloud\u0026logoColor=white)\n### On primise\n![Apache Spark](https://img.shields.io/badge/Apache%20Spark-FDEE21?style=flat-square\u0026logo=apachespark\u0026logoColor=black)\n![Apache Hadoop](https://img.shields.io/badge/Apache%20Hadoop-66CCFF?style=for-the-badge\u0026logo=apachehadoop\u0026logoColor=black)\n\n## Main Stack\n- Data Ingestion\n- Data Processing \u0026 Transformation\n- Data Governance \u0026 Data Catalogs\n- Data Warehouse \u0026 Datalake\n- Data Analytics\n\n![alt text](./assets/dataplatform.gif)\n\n## Tools for Big Data Engineer\n### Workflow Schedule\n1. Airflow\n\n## Data Ingestion\n### Batch Ingestion\n1. SASS Tool: Fivetran, Hevo Data ..\n2. Opensource Tools: Airbyte, Singer, Streamsets\n3. Custom Data Ingestion built in on orchestration engines like: Python + Airflow, Java Application, Other ...\n### Streaming Ingestion\n1. Apache Spark\n2. Apache Flink\n### CDC (Change Data Capture)\n1. Debezium\n![cdc](./assets/cdc_debezium_server.gif)\n## Data Transformation\n### Batch\n1. DBT (Data Build Tool)\n2. Apache Spark\n3. Apache Flink\n### Streaming\n1. Apache Spark\n2. Apache Flink\n\n## Data Warehouse \u0026 Lake\n### Data Warehouse Storage\n1. Hadoop\n2. Bigquery\n3. Redshift\n4. Snowflake\n\n### Data Lake Storage\n1. Hadoop (On primise)\n2. Google Cloud Storage (GCP)\n3. S3 (AWS)\n## Data Governance\n1. Apache Atlas\n2. Azure Microsoft Purview \n3. Data Catalog(GCP)\n4. Unity Catalog\n## Data Analysis\n\n1. Metabase\n2. Superset\n3. PowerBI\n4. Data Looker\n5. Tableau\n\n## MLOps\n1. Kubeflow\n2. Minio\n## Contact Me\n- 😀 LinkedIn: https://www.linkedin.com/tuanbacam\n- 🌱 Email: nguyenvantuan140497@gmail.com\n- 🇻🇳 Country: VietNam","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftuanai-vireox%2Fdataplatform-stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftuanai-vireox%2Fdataplatform-stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftuanai-vireox%2Fdataplatform-stack/lists"}