https://github.com/banickn/dagster-iceberg
https://github.com/banickn/dagster-iceberg
dagster data-engineering
Last synced: 7 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/banickn/dagster-iceberg
- Owner: banickn
- Created: 2024-11-24T09:07:04.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-11-28T22:33:53.000Z (11 months ago)
- Last Synced: 2025-01-25T19:42:01.392Z (9 months ago)
- Topics: dagster, data-engineering
- Language: Python
- Homepage:
- Size: 104 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Dagster-Iceberg project
This is a project to investigate how to set up a modern data toolstack with Dagster, Apache Iceberg, Azure and DuckDB or Daft.
## Get started
- Create a **.env** in fab-data/ like this:
```
AZURE_CONNECTION_STRING = ""
AZURE_BRONZE_CONTAINER_NAME = ""
AZURE_SILVER_CONTAINER_NAME = ""
AZURE_GOLD_CONTAINER_NAME = ""
AZURE_STORAGE_ACCOUNT_NAME = ""
AZURE_STORAGE_ACCOUNT_KEY = ""
```
- Install python modules.
TODO.
- Start dagster to run **setup_silver** and **setup_gold assets**.
These jobs create local sqlite Iceberg catalogs and the namespaces/tables in Azure.
- Run **fake_data.py** to create fake semiconductor manufacturing data.
These json files will get loaded automatically into an Azure container as raw data if the sensor is activated in Dagster.
- Running **write_silver_fabdata** and **write_gold_fabreport** loads the data into Iceberg tables and execute some basic aggregations for the gold layer.
- With "**streamlit run fab_report.py**" you can start a simple Streamlit report dashboard that uses the gold layer.
## Architecture
```mermaid
graph TD
subgraph Data Sources
Batch[Batch Sources]
Stream[Streaming Sources]
end
subgraph Orchestrator
direction TB
Dagster[Dagster]
end
subgraph Visualization
direction TB
Streamlit[Streamlit]
end
subgraph Data Lakehouse
direction LR
Bronze[**Bronze Layer**
Raw data
JSON]
Silver[**Silver Layer**
Cleaned, Augmented Data
Apache Iceberg]
Gold[**Gold Layer**
Aggregates
Apache Iceberg]
end
Batch --> Bronze
Stream --> Bronze
Bronze --> Silver
Silver --> Gold
Streamlit --> Gold
Dagster --> Bronze
Dagster --> Silver
Dagster --> Gold
style Bronze fill:#CE8946,stroke:#333,stroke-width:2px
style Silver fill:#C0C0C0,stroke:#333,stroke-width:2px
style Gold fill:#FFD700,stroke:#333,stroke-width:2px
style Dagster fill:#5eb1ef,stroke:#333,stroke-width:2px
```
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.