An open API service indexing awesome lists of open source software.

https://github.com/stackql/databricks-lakeflow-jobs-example

Demo of Databricks Lakeflow Jobs Automation with StackQL and Databricks Asset Bundles
https://github.com/stackql/databricks-lakeflow-jobs-example

Last synced: 3 months ago
JSON representation

Demo of Databricks Lakeflow Jobs Automation with StackQL and Databricks Asset Bundles

Awesome Lists containing this project

README

          

# Databricks Lakeflow Jobs with StackQL-Deploy

A complete end-to-end demonstration of deploying and managing **Databricks Lakeflow jobs** using **StackQL-Deploy** for infrastructure provisioning and **Databricks Asset Bundles (DABs)** for data pipeline management.

[![Databricks Asset Bundle CI/CD](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml/badge.svg)](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml)

## ๐ŸŽฏ Project Overview

This repository demonstrates modern DataOps practices by combining:

- **๐Ÿ—๏ธ Infrastructure as Code**: Using [StackQL](https://stackql.io) and [stackql-deploy](https://stackql-deploy.io) for SQL-based infrastructure management
- **๐Ÿ“Š Data Pipeline Management**: Using [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) for job orchestration and deployment
- **๐Ÿš€ GitOps CI/CD**: Automated infrastructure provisioning and data pipeline deployment via GitHub Actions

### What This Project Does

1. **Provisions Databricks Infrastructure** using StackQL-Deploy:
- AWS IAM roles and cross-account permissions
- S3 buckets for workspace storage
- Databricks workspace with Unity Catalog
- Storage credentials and external locations

2. **Deploys a Retail Data Pipeline** using Databricks Asset Bundles:
- Multi-stage data processing (Bronze โ†’ Silver โ†’ Gold)
- Parallel task execution with dependency management
- State-based conditional processing
- For-each loops for parallel state processing

3. **Automates Everything** with GitHub Actions:
- Infrastructure provisioning on push to main
- DAB validation and deployment
- Multi-environment support (dev/prod)

## ๐Ÿ›๏ธ Architecture

```mermaid
graph TB
subgraph "GitHub Repository"
A[infrastructure/] --> B[StackQL-Deploy]
C[retail-job/] --> D[Databricks Asset Bundle]
end

subgraph "AWS Cloud"
B --> E[IAM Roles]
B --> F[S3 Buckets]
B --> G[VPC/Security Groups]
end

subgraph "Databricks Platform"
B --> H[Workspace]
D --> I[Lakeflow Jobs]
H --> I
I --> J[Bronze Tables]
I --> K[Silver Tables]
I --> L[Gold Tables]
end

subgraph "CI/CD Pipeline"
M[GitHub Actions] --> B
M --> D
M --> N[Multi-Environment Deployment]
end
```

## ๐Ÿ“ Repository Structure

```
databricks-lakeflow-jobs-example/
โ”œโ”€โ”€ infrastructure/ # StackQL infrastructure templates
โ”‚ โ”œโ”€โ”€ README.md # Infrastructure setup guide
โ”‚ โ”œโ”€โ”€ stackql_manifest.yml # StackQL deployment configuration
โ”‚ โ””โ”€โ”€ resources/ # Cloud resource templates
โ”‚ โ”œโ”€โ”€ aws/ # AWS resources (IAM, S3)
โ”‚ โ”œโ”€โ”€ databricks_account/ # Account-level Databricks resources
โ”‚ โ””โ”€โ”€ databricks_workspace/ # Workspace configurations
โ”œโ”€โ”€ retail-job/ # Databricks Asset Bundle
โ”‚ โ”œโ”€โ”€ databricks.yml # DAB configuration
โ”‚ โ””โ”€โ”€ Task Files/ # Data pipeline notebooks
โ”‚ โ”œโ”€โ”€ 01_data_ingestion/ # Bronze layer data ingestion
โ”‚ โ”œโ”€โ”€ 02_data_loading/ # Customer data loading
โ”‚ โ”œโ”€โ”€ 03_data_processing/ # Silver layer transformations
โ”‚ โ”œโ”€โ”€ 04_data_transformation/ # Gold layer clean data
โ”‚ โ””โ”€โ”€ 05_state_processing/ # State-specific processing
โ””โ”€โ”€ .github/workflows/ # CI/CD automation
โ””โ”€โ”€ databricks-dab.yml # GitHub Actions workflow
```

## ๐Ÿš€ Quick Start

### Prerequisites

- AWS account with administrative permissions
- Databricks account (see [infrastructure setup guide](./infrastructure/README.md))
- Python 3.8+ and Git

### 1. Clone Repository

```bash
git clone https://github.com/stackql/databricks-lakeflow-jobs-example.git
cd databricks-lakeflow-jobs-example
```

### 2. Set Up Infrastructure

Follow the comprehensive [Infrastructure Setup Guide](./infrastructure/README.md) to:
- Configure AWS and Databricks accounts
- Set up service principals and permissions
- Deploy infrastructure using StackQL-Deploy

### 3. Deploy Data Pipeline

Once infrastructure is provisioned:

```bash
cd retail-job

# Validate the bundle
databricks bundle validate --target dev

# Deploy the data pipeline
databricks bundle deploy --target dev

# Run the complete pipeline
databricks bundle run retail_data_processing_job --target dev
```

## ๐Ÿ“Š Data Pipeline Deep Dive

The retail data pipeline demonstrates a complete **medallion architecture** (Bronze โ†’ Silver โ†’ Gold):

### Pipeline Stages

1. **๐Ÿฅ‰ Bronze Layer - Data Ingestion**
- **Orders Ingestion**: Loads raw sales orders data
- **Sales Ingestion**: Loads raw sales transaction data
- Tables: `orders_bronze`, `sales_bronze`

2. **๐Ÿฅˆ Silver Layer - Data Processing**
- **Customer Loading**: Loads customer master data
- **Data Joining**: Joins customers with sales and orders
- **Duplicate Removal**: Conditional deduplication based on data quality
- Tables: `customers_bronze`, `customer_sales_silver`, `customer_orders_silver`

3. **๐Ÿฅ‡ Gold Layer - Data Transformation**
- **Clean & Transform**: Business-ready, curated datasets
- **State Processing**: Parallel processing for each US state using for-each loops
- Tables: `retail_gold`, `state_summary_gold`

### Advanced DAB Features Demonstrated

- **๐Ÿ”„ Parallel Execution**: Multiple tasks run concurrently where dependencies allow
- **๐ŸŽฏ Conditional Tasks**: Deduplication only runs if duplicates are detected
- **๐Ÿ” For-Each Loops**: State processing runs in parallel for multiple states
- **๐Ÿ“ง Notifications**: Email alerts on job success/failure
- **โฑ๏ธ Timeouts & Limits**: Job execution controls and concurrent run limits
- **๐ŸŽ›๏ธ Parameters**: Dynamic state-based processing with base parameters

## ๐Ÿ”„ CI/CD Pipeline

The GitHub Actions workflow ([`.github/workflows/databricks-dab.yml`](./.github/workflows/databricks-dab.yml)) provides complete automation:

### Workflow Triggers

- **Pull Requests**: Validates changes against dev environment
- **Main Branch Push**: Deploys to production environment
- **Path-Based**: Only triggers on infrastructure or job configuration changes

### Deployment Steps

1. **๐Ÿ—๏ธ Infrastructure Provisioning**
```yaml
- name: Deploy Infrastructure with StackQL
uses: stackql/stackql-deploy-action@v1.0.2
with:
command: 'build'
stack_dir: 'infrastructure'
stack_env: ${{ env.ENVIRONMENT }}
```

2. **๐Ÿ“Š Workspace Configuration**
- Extracts workspace details from StackQL deployment
- Configures Databricks CLI with workspace credentials
- Sets up environment-specific configurations

3. **โœ… DAB Validation & Deployment**
```yaml
- name: Validate Databricks Asset Bundle
run: databricks bundle validate --target ${{ env.ENVIRONMENT }}

- name: Deploy Databricks Jobs
run: databricks bundle deploy --target ${{ env.ENVIRONMENT }}
```

4. **๐Ÿงช Pipeline Testing**
- Runs the complete data pipeline
- Validates job execution and data quality
- Reports results and generates summaries

### Environment Management

The workflow supports multiple environments with automatic detection:
- **Dev Environment**: For pull requests and feature development
- **Production Environment**: For main branch deployments

Environment-specific configurations are managed through:
- StackQL environment variables and stack environments
- Databricks Asset Bundle targets (`dev`, `prd`)
- GitHub repository secrets for credentials

## ๐Ÿ› ๏ธ Key Technologies

### StackQL & stackql-deploy
- **SQL-based Infrastructure**: Manage cloud resources using familiar SQL syntax
- **State-free Operations**: No state files - query infrastructure directly from APIs
- **Multi-cloud Support**: Consistent interface across AWS, Azure, GCP, and SaaS providers
- **GitOps Ready**: Native CI/CD integration with GitHub Actions

### Databricks Asset Bundles
- **Environment Consistency**: Deploy the same code across dev/staging/prod
- **Version Control**: Infrastructure and code in sync with Git workflows
- **Advanced Orchestration**: Complex dependencies, conditions, and parallel execution
- **Resource Management**: Automated cluster provisioning and job scheduling

### Modern DataOps Practices
- **Infrastructure as Code**: Everything versioned and reproducible
- **GitOps Workflows**: Pull request-based infrastructure changes
- **Environment Parity**: Identical configurations across environments
- **Automated Testing**: Pipeline validation and data quality checks

## ๐Ÿ“š Learn More

- **[Infrastructure Setup Guide](./infrastructure/README.md)**: Complete StackQL-Deploy setup and usage
- **[StackQL Documentation](https://stackql.io/docs)**: Learn SQL-based infrastructure management
- **[Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/)**: DAB concepts and advanced patterns
- **[stackql-deploy GitHub Action](https://github.com/stackql/stackql-deploy-action)**: CI/CD integration guide

## ๐Ÿค Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## โš ๏ธ Important Notes

- **Cost Management**: This project provisions billable cloud resources. Always run teardown commands after testing.
- **Cleanup Required**: Cancel Databricks subscription after completing the exercise to avoid ongoing charges.
- **Security**: Never commit credentials to version control. Use environment variables and CI/CD secrets.

---

*Demonstrating the future of DataOps with SQL-based infrastructure management and modern data pipeline orchestration.*