https://github.com/lugolbis/data-immo
End-to-end ETL pipeline
https://github.com/lugolbis/data-immo
data data-engineering dbt dremio duckdb etl-pipeline lakehouse rust
Last synced: 21 days ago
JSON representation
End-to-end ETL pipeline
- Host: GitHub
- URL: https://github.com/lugolbis/data-immo
- Owner: LugolBis
- Created: 2025-08-16T15:18:01.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-30T15:05:31.000Z (10 months ago)
- Last Synced: 2025-08-30T17:22:06.314Z (10 months ago)
- Topics: data, data-engineering, dbt, dremio, duckdb, etl-pipeline, lakehouse, rust
- Language: Rust
- Homepage:
- Size: 216 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# data-immo
**data-immo** is a high-performance ETL pipeline built in **Rust** to efficiently extract, transform, and load real estate transaction data from the **DVF+ API** into a **Lakehouse (Dremio)**.
The project is designed with a focus on **performance, reliability, and scalability**, leveraging modern data engineering tools and practices.
## πβ Schema of the pipeline
```mermaid
flowchart LR
A[API DVF+] -->|Extraction| B[Rust]
B -->|Transformation| C[Rust]
C -->|Loading| D[(DuckDB)]
D -->|Saving cleaned data| E{dbt}
E -->|Validation
& loading| F[(Dremio)]
subgraph Extraction [Extract]
A
end
subgraph TransformationRust [Transform]
B
C
end
subgraph LT [Load & Transform]
D
end
subgraph VD [Validate Data quality]
E
end
subgraph LakeHouse [LakeHouse]
F
end
subgraph Docker [Docker Container]
LakeHouse
end
style A fill:#000091,stroke:#ffffff,color:#ffffff,stroke-width:1px
style B fill:#955a34,stroke:#000000,color:#000000,stroke-width:1px
style C fill:#955a34,stroke:#000000,color:#000000,stroke-width:1px
style D fill:#fef242,stroke:#000000,color:#000000,stroke-width:1px
style E fill:#fc7053,stroke:#000000,color:#000000,stroke-width:1px
style F fill:#31d3db,stroke:#ffffff,color:#ffffff,stroke-width:1px
style Docker fill: #099cec
```
## π Features
- **Data Extraction**
- Fetches real estate transaction data from the **DVF+ API**.
- Handles API rate limiting, retries, and efficient pagination using Rustβs concurrency model.
- **Data Transformation**
- Uses **DuckDB** to transform optimized **Parquet** data into structured, queryable formats.
- Rust is used for additional transformations, data enrichment, and performance-critical operations (I/O, etc.).
- **Data Validation & Loading**
- **dbt** is used to validate, test, and model the data.
- The cleaned and validated data is loaded into **Dremio**, enabling a Lakehouse architecture.
## π οΈ Tech Stack
- **Rust** β Core language for API calls, transformations, and performance optimization.
- **DuckDB** β In-process SQL engine for fast transformations of optimized Parquet datasets.
- **dbt** β Data modeling, testing, and validation layer.
- **Dremio** β Lakehouse platform for analytics and querying.
## π Pipeline Overview
1. **Extract**: Retrieve raw transaction data from DVF+ API.
2. **Stage**: Store raw data as Parquet.
3. **Transform**: Apply transformations using DuckDB and Rust.
4. **Validate & Model**: Use dbt to ensure data quality and prepare final schemas.
5. **Load**: Push validated datasets into Dremio for downstream analytics.