{"id":32565861,"url":"https://github.com/stackql/databricks-lakeflow-jobs-example","last_synced_at":"2025-10-29T04:54:01.899Z","repository":{"id":318754697,"uuid":"1075673167","full_name":"stackql/databricks-lakeflow-jobs-example","owner":"stackql","description":"Demo of Databricks Lakeflow Jobs Automation with StackQL and Databricks Asset Bundles","archived":false,"fork":false,"pushed_at":"2025-10-14T22:14:46.000Z","size":71,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-15T02:19:32.028Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stackql.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-13T20:33:41.000Z","updated_at":"2025-10-14T22:14:49.000Z","dependencies_parsed_at":"2025-10-15T04:48:41.084Z","dependency_job_id":"9a637f20-cc38-46b8-a3cb-97a79219d8dd","html_url":"https://github.com/stackql/databricks-lakeflow-jobs-example","commit_stats":null,"previous_names":["stackql/databricks-lakeflow-jobs-example"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/stackql/databricks-lakeflow-jobs-example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stackql%2Fdatabricks-lakeflow-jobs-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stackql%2Fdatabricks-lakeflow-jobs-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stackql%2Fdatabricks-lakeflow-jobs-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stackql%2Fdatabricks-lakeflow-jobs-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stackql","download_url":"https://codeload.github.com/stackql/databricks-lakeflow-jobs-example/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stackql%2Fdatabricks-lakeflow-jobs-example/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281563794,"owners_count":26522704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-29T02:00:06.901Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-29T04:54:00.741Z","updated_at":"2025-10-29T04:54:01.891Z","avatar_url":"https://github.com/stackql.png","language":"Python","readme":"# Databricks Lakeflow Jobs with StackQL-Deploy\r\n\r\nA complete end-to-end demonstration of deploying and managing **Databricks Lakeflow jobs** using **StackQL-Deploy** for infrastructure provisioning and **Databricks Asset Bundles (DABs)** for data pipeline management.\r\n\r\n[![Databricks Asset Bundle CI/CD](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml/badge.svg)](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml)\r\n\r\n## 🎯 Project Overview\r\n\r\nThis repository demonstrates modern DataOps practices by combining:\r\n\r\n- **🏗️ Infrastructure as Code**: Using [StackQL](https://stackql.io) and [stackql-deploy](https://stackql-deploy.io) for SQL-based infrastructure management\r\n- **📊 Data Pipeline Management**: Using [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) for job orchestration and deployment\r\n- **🚀 GitOps CI/CD**: Automated infrastructure provisioning and data pipeline deployment via GitHub Actions\r\n\r\n### What This Project Does\r\n\r\n1. **Provisions Databricks Infrastructure** using StackQL-Deploy:\r\n   - AWS IAM roles and cross-account permissions\r\n   - S3 buckets for workspace storage\r\n   - Databricks workspace with Unity Catalog\r\n   - Storage credentials and external locations\r\n\r\n2. **Deploys a Retail Data Pipeline** using Databricks Asset Bundles:\r\n   - Multi-stage data processing (Bronze → Silver → Gold)\r\n   - Parallel task execution with dependency management\r\n   - State-based conditional processing\r\n   - For-each loops for parallel state processing\r\n\r\n3. **Automates Everything** with GitHub Actions:\r\n   - Infrastructure provisioning on push to main\r\n   - DAB validation and deployment\r\n   - Multi-environment support (dev/prod)\r\n\r\n## 🏛️ Architecture\r\n\r\n```mermaid\r\ngraph TB\r\n    subgraph \"GitHub Repository\"\r\n        A[infrastructure/] --\u003e B[StackQL-Deploy]\r\n        C[retail-job/] --\u003e D[Databricks Asset Bundle]\r\n    end\r\n    \r\n    subgraph \"AWS Cloud\"\r\n        B --\u003e E[IAM Roles]\r\n        B --\u003e F[S3 Buckets]\r\n        B --\u003e G[VPC/Security Groups]\r\n    end\r\n    \r\n    subgraph \"Databricks Platform\"\r\n        B --\u003e H[Workspace]\r\n        D --\u003e I[Lakeflow Jobs]\r\n        H --\u003e I\r\n        I --\u003e J[Bronze Tables]\r\n        I --\u003e K[Silver Tables]\r\n        I --\u003e L[Gold Tables]\r\n    end\r\n    \r\n    subgraph \"CI/CD Pipeline\"\r\n        M[GitHub Actions] --\u003e B\r\n        M --\u003e D\r\n        M --\u003e N[Multi-Environment Deployment]\r\n    end\r\n```\r\n\r\n## 📁 Repository Structure\r\n\r\n```\r\ndatabricks-lakeflow-jobs-example/\r\n├── infrastructure/                    # StackQL infrastructure templates\r\n│   ├── README.md                     # Infrastructure setup guide\r\n│   ├── stackql_manifest.yml         # StackQL deployment configuration\r\n│   └── resources/                    # Cloud resource templates\r\n│       ├── aws/                      # AWS resources (IAM, S3)\r\n│       ├── databricks_account/       # Account-level Databricks resources\r\n│       └── databricks_workspace/     # Workspace configurations\r\n├── retail-job/                       # Databricks Asset Bundle\r\n│   ├── databricks.yml               # DAB configuration\r\n│   └── Task Files/                   # Data pipeline notebooks\r\n│       ├── 01_data_ingestion/        # Bronze layer data ingestion\r\n│       ├── 02_data_loading/          # Customer data loading\r\n│       ├── 03_data_processing/       # Silver layer transformations\r\n│       ├── 04_data_transformation/   # Gold layer clean data\r\n│       └── 05_state_processing/      # State-specific processing\r\n└── .github/workflows/                # CI/CD automation\r\n    └── databricks-dab.yml           # GitHub Actions workflow\r\n```\r\n\r\n## 🚀 Quick Start\r\n\r\n### Prerequisites\r\n\r\n- AWS account with administrative permissions\r\n- Databricks account (see [infrastructure setup guide](./infrastructure/README.md))\r\n- Python 3.8+ and Git\r\n\r\n### 1. Clone Repository\r\n\r\n```bash\r\ngit clone https://github.com/stackql/databricks-lakeflow-jobs-example.git\r\ncd databricks-lakeflow-jobs-example\r\n```\r\n\r\n### 2. Set Up Infrastructure\r\n\r\nFollow the comprehensive [Infrastructure Setup Guide](./infrastructure/README.md) to:\r\n- Configure AWS and Databricks accounts\r\n- Set up service principals and permissions\r\n- Deploy infrastructure using StackQL-Deploy\r\n\r\n### 3. Deploy Data Pipeline\r\n\r\nOnce infrastructure is provisioned:\r\n\r\n```bash\r\ncd retail-job\r\n\r\n# Validate the bundle\r\ndatabricks bundle validate --target dev\r\n\r\n# Deploy the data pipeline\r\ndatabricks bundle deploy --target dev\r\n\r\n# Run the complete pipeline\r\ndatabricks bundle run retail_data_processing_job --target dev\r\n```\r\n\r\n## 📊 Data Pipeline Deep Dive\r\n\r\nThe retail data pipeline demonstrates a complete **medallion architecture** (Bronze → Silver → Gold):\r\n\r\n### Pipeline Stages\r\n\r\n1. **🥉 Bronze Layer - Data Ingestion**\r\n   - **Orders Ingestion**: Loads raw sales orders data\r\n   - **Sales Ingestion**: Loads raw sales transaction data\r\n   - Tables: `orders_bronze`, `sales_bronze`\r\n\r\n2. **🥈 Silver Layer - Data Processing**\r\n   - **Customer Loading**: Loads customer master data\r\n   - **Data Joining**: Joins customers with sales and orders\r\n   - **Duplicate Removal**: Conditional deduplication based on data quality\r\n   - Tables: `customers_bronze`, `customer_sales_silver`, `customer_orders_silver`\r\n\r\n3. **🥇 Gold Layer - Data Transformation**\r\n   - **Clean \u0026 Transform**: Business-ready, curated datasets\r\n   - **State Processing**: Parallel processing for each US state using for-each loops\r\n   - Tables: `retail_gold`, `state_summary_gold`\r\n\r\n### Advanced DAB Features Demonstrated\r\n\r\n- **🔄 Parallel Execution**: Multiple tasks run concurrently where dependencies allow\r\n- **🎯 Conditional Tasks**: Deduplication only runs if duplicates are detected\r\n- **🔁 For-Each Loops**: State processing runs in parallel for multiple states\r\n- **📧 Notifications**: Email alerts on job success/failure\r\n- **⏱️ Timeouts \u0026 Limits**: Job execution controls and concurrent run limits\r\n- **🎛️ Parameters**: Dynamic state-based processing with base parameters\r\n\r\n## 🔄 CI/CD Pipeline\r\n\r\nThe GitHub Actions workflow ([`.github/workflows/databricks-dab.yml`](./.github/workflows/databricks-dab.yml)) provides complete automation:\r\n\r\n### Workflow Triggers\r\n\r\n- **Pull Requests**: Validates changes against dev environment\r\n- **Main Branch Push**: Deploys to production environment\r\n- **Path-Based**: Only triggers on infrastructure or job configuration changes\r\n\r\n### Deployment Steps\r\n\r\n1. **🏗️ Infrastructure Provisioning**\r\n   ```yaml\r\n   - name: Deploy Infrastructure with StackQL\r\n     uses: stackql/stackql-deploy-action@v1.0.2\r\n     with:\r\n       command: 'build'\r\n       stack_dir: 'infrastructure'\r\n       stack_env: ${{ env.ENVIRONMENT }}\r\n   ```\r\n\r\n2. **📊 Workspace Configuration**\r\n   - Extracts workspace details from StackQL deployment\r\n   - Configures Databricks CLI with workspace credentials\r\n   - Sets up environment-specific configurations\r\n\r\n3. **✅ DAB Validation \u0026 Deployment**\r\n   ```yaml\r\n   - name: Validate Databricks Asset Bundle\r\n     run: databricks bundle validate --target ${{ env.ENVIRONMENT }}\r\n   \r\n   - name: Deploy Databricks Jobs\r\n     run: databricks bundle deploy --target ${{ env.ENVIRONMENT }}\r\n   ```\r\n\r\n4. **🧪 Pipeline Testing**\r\n   - Runs the complete data pipeline\r\n   - Validates job execution and data quality\r\n   - Reports results and generates summaries\r\n\r\n### Environment Management\r\n\r\nThe workflow supports multiple environments with automatic detection:\r\n- **Dev Environment**: For pull requests and feature development\r\n- **Production Environment**: For main branch deployments\r\n\r\nEnvironment-specific configurations are managed through:\r\n- StackQL environment variables and stack environments\r\n- Databricks Asset Bundle targets (`dev`, `prd`)\r\n- GitHub repository secrets for credentials\r\n\r\n## 🛠️ Key Technologies\r\n\r\n### StackQL \u0026 stackql-deploy\r\n- **SQL-based Infrastructure**: Manage cloud resources using familiar SQL syntax\r\n- **State-free Operations**: No state files - query infrastructure directly from APIs\r\n- **Multi-cloud Support**: Consistent interface across AWS, Azure, GCP, and SaaS providers\r\n- **GitOps Ready**: Native CI/CD integration with GitHub Actions\r\n\r\n### Databricks Asset Bundles\r\n- **Environment Consistency**: Deploy the same code across dev/staging/prod\r\n- **Version Control**: Infrastructure and code in sync with Git workflows\r\n- **Advanced Orchestration**: Complex dependencies, conditions, and parallel execution\r\n- **Resource Management**: Automated cluster provisioning and job scheduling\r\n\r\n### Modern DataOps Practices\r\n- **Infrastructure as Code**: Everything versioned and reproducible\r\n- **GitOps Workflows**: Pull request-based infrastructure changes\r\n- **Environment Parity**: Identical configurations across environments\r\n- **Automated Testing**: Pipeline validation and data quality checks\r\n\r\n## 📚 Learn More\r\n\r\n- **[Infrastructure Setup Guide](./infrastructure/README.md)**: Complete StackQL-Deploy setup and usage\r\n- **[StackQL Documentation](https://stackql.io/docs)**: Learn SQL-based infrastructure management\r\n- **[Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/)**: DAB concepts and advanced patterns\r\n- **[stackql-deploy GitHub Action](https://github.com/stackql/stackql-deploy-action)**: CI/CD integration guide\r\n\r\n## 🤝 Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\r\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\r\n4. Push to the branch (`git push origin feature/amazing-feature`)\r\n5. Open a Pull Request\r\n\r\n## 📄 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## ⚠️ Important Notes\r\n\r\n- **Cost Management**: This project provisions billable cloud resources. Always run teardown commands after testing.\r\n- **Cleanup Required**: Cancel Databricks subscription after completing the exercise to avoid ongoing charges.\r\n- **Security**: Never commit credentials to version control. Use environment variables and CI/CD secrets.\r\n\r\n---\r\n\r\n*Demonstrating the future of DataOps with SQL-based infrastructure management and modern data pipeline orchestration.*","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstackql%2Fdatabricks-lakeflow-jobs-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstackql%2Fdatabricks-lakeflow-jobs-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstackql%2Fdatabricks-lakeflow-jobs-example/lists"}