https://github.com/cmpadden/dagster-databricks-components-demo
https://github.com/cmpadden/dagster-databricks-components-demo
Last synced: 11 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/cmpadden/dagster-databricks-components-demo
- Owner: cmpadden
- Created: 2025-07-14T20:21:48.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-22T21:14:54.000Z (11 months ago)
- Last Synced: 2025-07-22T21:24:23.631Z (11 months ago)
- Language: Python
- Size: 159 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# Dagster Databricks Components Demo
This project demonstrates how to use Dagster Components to interface with Databricks and create a unified view of your data platform. It showcases how components make it easy to orchestrate Databricks jobs while maintaining full visibility and lineage tracking within Dagster's single pane of glass.
## Overview
The demo includes:
- **Custom Databricks Job Component**: A reusable component that wraps Databricks jobs as Dagster assets
- **Asset Specifications**: Declarative asset definitions with proper lineage and metadata
- **Cross-Platform Integration**: Seamless connection between Dagster orchestration and Databricks execution
- **Unified Monitoring**: View all your data assets and their dependencies in one place
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Dagster UI │ │ Databricks │ │ Data Assets │
│ │ │ Workspace │ │ │
│ • Asset Lineage │◄──►│ • Job Execution │◄──►│ • S3 Buckets │
│ • Monitoring │ │ • Compute │ │ • Tables │
│ • Scheduling │ │ • Notebooks │ │ • Reports │
└─────────────────┘ └──────────────────┘ └─────────────────┘
```
## Features
- **Databricks Job Integration**: Execute Databricks jobs directly from Dagster with full parameter passing
- **Asset Lineage**: Track data dependencies across your entire pipeline
- **Metadata Enrichment**: Automatically capture job run information, timing, and parameters
- **Environment Configuration**: Secure credential management using environment variables
- **Declarative Components**: Define your data pipelines using YAML configuration
## Project Structure
```
dagster-databricks-components-demo/
├── src/
│ └── dagster_databricks_components_demo/
│ ├── components/
│ │ └── databricks_job_component.py # Custom Databricks component
│ ├── defs/
│ │ └── databricks_job/
│ │ └── defs.yaml # Component configuration
│ └── definitions.py # Main Dagster definitions
├── pyproject.toml # Project dependencies
└── README.md # This file
```
## Quick Start
### Prerequisites
- Python 3.9-3.13.3
- uv package manager
- Databricks workspace access
- Databricks job ID and credentials
### Installation
1. **Clone and navigate to the project:**
```bash
cd dagster-databricks-components-demo
```
2. **Create and activate a virtual environment:**
```bash
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```
3. **Install dependencies:**
```bash
uv sync
```
### Configuration
4. **Set up environment variables:**
```bash
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-databricks-token"
```
### Running the Demo
5. **Start the Dagster development server:**
```bash
dg dev
```
6. **Access the Dagster UI:**
Open your browser to `http://localhost:3000` to explore the asset lineage, trigger materializations, and monitor your Databricks jobs.
## Component Configuration
The demo uses a YAML-based configuration in `src/dagster_databricks_components_demo/defs/databricks_job/defs.yaml`:
```yaml
type: dagster_databricks_components_demo.components.databricks_job_component.DatabricksJobComponent
attributes:
job_id: 1000180891217799 # Your Databricks job ID
job_parameters:
source_file_prefix: "s3://acme-analytics/raw"
destination_file_prefix: "s3://acme-analytics/reports"
workspace_config:
host: "{{ env.DATABRICKS_HOST }}"
token: "{{ env.DATABRICKS_TOKEN }}"
assets:
- key: account_performance
owners: ["alice@acme.com"]
deps: [prepared_accounts, prepared_customers]
kinds: [parquet]
```
## Key Benefits
- **Unified Orchestration**: Manage both Dagster and Databricks workloads from a single interface
- **Complete Lineage**: Track data flow from raw sources through Databricks transformations to final outputs
- **Operational Excellence**: Monitor job health, performance, and data quality in one place
- **Developer Experience**: Write infrastructure as code with type-safe, declarative components
- **Scalability**: Leverage Databricks' compute power while maintaining Dagster's orchestration capabilities
## Next Steps
- Customize the `job_id` and `job_parameters` in the YAML configuration for your Databricks jobs
- Add additional asset specifications to match your data pipeline
- Explore scheduling and sensor capabilities for automated pipeline execution
- Integrate with your existing CI/CD workflows
For more information about Dagster Components, visit the [official documentation](https://docs.dagster.io/).