{"id":20882424,"url":"https://github.com/1ambda/lakehouse","last_synced_at":"2025-10-25T04:21:02.280Z","repository":{"id":190988852,"uuid":"683721965","full_name":"1ambda/lakehouse","owner":"1ambda","description":"Playground for Lakehouse (Iceberg, Hudi, Spark, Flink, Trino, DBT, Airflow, Kafka, Debezium CDC)","archived":false,"fork":false,"pushed_at":"2023-09-23T02:04:12.000Z","size":3443,"stargazers_count":53,"open_issues_count":1,"forks_count":9,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-01T09:03:49.753Z","etag":null,"topics":["airflow","cdc","dbt","debezium","docker","flink","hudi","iceberg","kafka","spark","trino"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/1ambda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-27T13:53:34.000Z","updated_at":"2025-03-24T02:31:15.000Z","dependencies_parsed_at":"2024-11-24T23:04:46.466Z","dependency_job_id":null,"html_url":"https://github.com/1ambda/lakehouse","commit_stats":null,"previous_names":["1ambda/lakehouse-playground"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ambda%2Flakehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ambda%2Flakehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ambda%2Flakehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1ambda%2Flakehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/1ambda","download_url":"https://codeload.github.com/1ambda/lakehouse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253797924,"owners_count":21965980,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","cdc","dbt","debezium","docker","flink","hudi","iceberg","kafka","spark","trino"],"created_at":"2024-11-18T07:31:39.456Z","updated_at":"2025-10-25T04:20:57.262Z","avatar_url":"https://github.com/1ambda.png","language":"Kotlin","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lakehouse Playground\n\n![check](https://github.com/1ambda/lakehouse/actions/workflows/check.yml/badge.svg)\n\nSupported Data Pipeline Components\n\n| Pipeline Component                     | Version | Description              |\n|----------------------------------------|---------|--------------------------|\n| [Trino](https://trino.io/)             | 425+    | Query Engine             |\n| [DBT](https://www.getdbt.com/)         | 1.5+    | Analytics Framework      |\n| [Spark](https://spark.apache.org/)     | 3.3+    | Computing Engine         |\n| [Flink](https://flink.apache.org/)     | 1.16+   | Computing Engine         |\n| [Iceberg](https://iceberg.apache.org/) | 1.3.1+  | Table Format (Lakehouse) |\n| [Hudi](https://hudi.apache.org/)       | 0.13.1+ | Table Format (Lakehouse) |\n| [Airflow](https://airflow.apache.org/) | 2.7+    | Scheduler                |\n| [Jupyterlab](https://jupyter.org/)     | 3+      | Notebook                 |\n| [Kafka](https://kafka.apache.org/)     | 3.4+    | Messaging Broker         |\n| [Debezium](https://debezium.io/)       | 2.3+    | CDC Connector            |\n\n\u003cbr/\u003e\n\n## Getting Started\n\nExecute compose containers first.\n\n```bash\n# Use `COMPOSE_PROFILES` to select the profile\nCOMPOSE_PROFILES=trino docker-compose up;\nCOMPOSE_PROFILES=spark docker-compose up;\nCOMPOSE_PROFILES=flink docker-compose up;\nCOMPOSE_PROFILES=airflow docker-compose up;\n\n# Combine multiple profiles\nCOMPOSE_PROFILES=trino,spark docker-compose up;\n\n# for CDC environment (Kafka, ZK, Debezium)\nmake compose.clean compose.cdc\n\n# for Stream environment (Kafka, ZK, Debezium + Flink)\nmake compose.clean compose.stream\n```\n\nThen access the lakehouse services.\n\n- Kafka UI: http://localhost:8088\n- Kafka Connect UI: http://localhost:8089\n- Trino: http://localhost:8889\n- Airflow (`airflow` / `airflow`) : http://localhost:8080\n- Local S3 Minio (`minio` / `minio123`): http://localhost:9000\n- Flink Job Manager UI (Docker): http://localhost:8082\n- Flink Job Manager UI (LocalApplication): http://localhost:8081\n- PySpark Jupyter Notebook (Iceberg): http://localhost:8900\n- PySpark Jupyter Notebook (Hudi): http://localhost:8901\n- Spark SQL (Iceberg): `docker exec -it spark-iceberg spark-sql`\n- Spark SQL (Hudi): `docker exec -it spark-hudi spark-sql`\n- Flink SQL (Iceberg): `docker exec -it flink-jobmanager flink-sql-iceberg`\n- Flink SQL (Hudi): `docker exec -it flink-jobmanager flink-sql-hudi;`\n\n\u003cbr/\u003e\n\n## CDC Starter kit\n\n```bash\n# Run cdc-related containers\nmake compose.cdc;\n\n# Register debezium mysql connector using Avro Schema Registry\nmake debezium.register.customers;\n\n# Register debezium mysql connector using JSON Format\nmake debezium.register.products;\n```\n\n### Running Flink Applications\n\nFlink supports Java 11 but uses Java 8 due to its SQL (Hive) dependency.\nThe Flink SQL Application within this project is written in Kotlin for SQL Readability.\n\n\nYou can run it as an Application in IDEA. (it is not a Kotlin Application)\nFor Flink Application, the required dependencies are already included within the Production Docker Image or EMR cluster.\n\nTherefore, they are set as 'Provided' dependencies in the Maven project, so to run them locally,\nyou can include the `Add dependencies with \"provided\" scope to classpath\"` IDEA option as shown in the screenshot below.\n\nAfter running the Local Flink Application, you can access the Flink Job Manager UI from localhost:8081.\n\n![idea](./docs/images/idea.png)\n\n\n## DBT Starter kit\n\n```bash\n# Run trino-related containers\nmake compose.dbt;\n\n# Prepare iceberg schema\nmake trino-cli;\n$ create schema iceberg.staging WITH ( LOCATION = 's3://datalake/staging' );\n$ create schema iceberg.mart WITH ( LOCATION = 's3://datalake/mart' );\n\n# Execute dbt commands locally\ncd dbts;\ndbt deps;\ndbt run;\ndbt test;\ndbt docs generate \u0026\u0026 dbt docs serve --port 8070; # http://localhost:8070\n\n# Select dbt-created tables from trino-cli\nmake trino-cli;\n$ SELECT * FROM iceberg.mart.aggr_location LIMIT 10;\n$ SELECT * FROM iceberg.staging.int_location LIMIT 10;\n$ SELECT * FROM iceberg.staging.stg_nations LIMIT 10;\n$ SELECT * FROM iceberg.staging.stg_regions LIMIT 10;\n\n# Execute airflow dags for dbt\nmake airflow.shell;\nairflow dags backfill dag_dbt --local --reset-dagruns  -s 2022-09-02 -e 2022-09-03;\n```\n\n## Screenshots\n\n### Flink Job Manager UI\n![flink](./docs/images/flink.png)\n\n### Kafka UI\n![kafka](./docs/images/kafka.png)\n\n### Minio UI\n![minio](./docs/images/minio.png)\n\n### Running Local Flink Application in IDEA\n![kafka](./docs/images/application.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1ambda%2Flakehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F1ambda%2Flakehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1ambda%2Flakehouse/lists"}