{"id":49869586,"url":"https://github.com/dmatrix/spark-declarative-pipelines","last_synced_at":"2026-05-15T05:02:55.829Z","repository":{"id":314524142,"uuid":"1055848693","full_name":"dmatrix/spark-declarative-pipelines","owner":"dmatrix","description":"A collection of modern Spark Declarative Pipeline  examples and implementations demonstrating different data processing paradigms and frameworks. ","archived":false,"fork":false,"pushed_at":"2025-11-23T20:03:50.000Z","size":4556,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-23T22:06:27.449Z","etag":null,"topics":["pyspark","spark","spark-declarative-pipelines"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmatrix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-12T22:53:54.000Z","updated_at":"2025-11-23T20:03:53.000Z","dependencies_parsed_at":"2025-09-13T01:23:26.925Z","dependency_job_id":"488f9613-f3e3-49fe-8c4e-7ff5d29d1578","html_url":"https://github.com/dmatrix/spark-declarative-pipelines","commit_stats":null,"previous_names":["dmatrix/etl-pipelines","dmatrix/spark-declarative-pipelines"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dmatrix/spark-declarative-pipelines","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmatrix%2Fspark-declarative-pipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmatrix%2Fspark-declarative-pipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmatrix%2Fspark-declarative-pipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmatrix%2Fspark-declarative-pipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmatrix","download_url":"https://codeload.github.com/dmatrix/spark-declarative-pipelines/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmatrix%2Fspark-declarative-pipelines/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33054454,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-15T02:00:06.351Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pyspark","spark","spark-declarative-pipelines"],"created_at":"2026-05-15T05:02:50.260Z","updated_at":"2026-05-15T05:02:51.500Z","avatar_url":"https://github.com/dmatrix.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Declarative Pipelines - Comprehensive Data Engineering Examples\n\n[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org/)\n[![PySpark](https://img.shields.io/badge/PySpark-4.1.0.preview2-orange.svg?style=flat\u0026logo=apache-spark\u0026logoColor=white)](https://spark.apache.org/)\n[![SDP](https://img.shields.io/badge/Spark%20Declarative%20Pipelines-SDP-brightgreen.svg?style=flat\u0026logo=apache-spark\u0026logoColor=white)](https://spark.apache.org/)\n[![Databricks](https://img.shields.io/badge/Databricks-SDP2DBX-red.svg?style=flat\u0026logo=databricks\u0026logoColor=white)](https://databricks.com/)\n[![UV](https://img.shields.io/badge/UV-Package%20Manager-purple.svg?style=flat)](https://github.com/astral-sh/uv)\n[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg?style=flat)](LICENSE)\n[![Code Style](https://img.shields.io/badge/Code%20Style-black-black.svg?style=flat)](https://github.com/psf/black)\n\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg?style=flat)](https://github.com/username/etl-pipelines/graphs/commit-activity)\n[![Made with Love](https://img.shields.io/badge/Made%20with-❤️-red.svg?style=flat)](https://github.com/username/etl-pipelines)\n[![Open Source](https://badges.frapsoft.com/os/v1/open-source.svg?v=103\u0026style=flat)](https://opensource.org/)\n\n---\n\nA collection of modern Spark Declarative Pipeline (SDP) implementations demonstrating different data processing paradigms. This repository showcases both open source Spark Declarative Pipelines (SDP) for analytics workloads and [Spark Declarative Pipelines on Databricks](https://docs.databricks.com/aws/en/dlt/), formely called LDP, for data processing.\n\n## 📁 Project Structure\n\n```\netl-pipelines/\n├── README.md                          # This overview document\n├── CLAUDE.md                          # Claude Code configuration\n└── src/py/\n    ├── sdp/                          # OSS Spark Declarative Pipelines examples\n    │   ├── README.md                 # Comprehensive SDP documentation\n    │   ├── daily_orders/             # E-commerce analytics pipeline\n    │   ├── oil_rigs/                 # Industrial IoT monitoring pipeline\n    │   └── utils/                    # Shared data generation utilities\n    ├── lsdp/                         # Lakeflow Spark Declarative Pipelines (on Databricks)\n    │   └── music_analytics/          # Million Song Dataset analytics pipeline\n    │       ├── README.md             # Music analytics documentation\n    │       ├── images/               # Pipeline visualization assets\n    │       └── transformations/      # SDP transformation definitions\n    └── generators/                   # Cross-framework data generators\n```\n\n## 🚀 Getting Started\n\n### SDP - Spark Declarative Pipelines\nPerfect for batch analytics and data science workloads using PySpark.\n\n```bash\n# Navigate to SDP examples\ncd src/py/sdp\n\n# Install dependencies with UV\nuv sync\n\n# Run Daily Orders e-commerce pipeline\npython main.py daily-orders\n\n# Run Oil Rigs sensor monitoring pipeline\npython main.py oil-rigs\n```\n\n### LSDP - Lakeflow Spark Declarative Pipelines (on Databricks)\nIdeal for streaming data processing with medallion architecture and data quality validation on Databricks Platform.\n\n```bash\n# Navigate to Music Analytics SDP example\ncd src/py/lsdp/music_analytics\n\n# Deploy pipeline to Databricks workspace\n# Pipeline processes Million Song Dataset with medallion architecture\n# See README.md for detailed implementation overview\n```\n\n## 📊 Use Cases Demonstrated\n\n### 1. **Daily Orders E-commerce Analytics** (SDP)\n- **Framework**: Spark Declarative Pipelines\n- **Data**: Synthetic e-commerce orders with 20+ product categories\n- **Features**: Order lifecycle management, sales tax calculations, business analytics\n- **Storage**: Local Spark warehouse with Parquet files\n- **Scale**: Development/testing workloads\n\n### 2. **Oil Rigs Industrial Monitoring** (SDP) \n- **Framework**: Spark Declarative Pipelines\n- **Data**: IoT sensor data from Texas oil fields (temperature, pressure, water level)\n- **Features**: Multi-location monitoring, statistical analysis, interactive visualizations\n- **Storage**: Local Spark warehouse with time-series data\n- **Scale**: Sensor analytics and operational monitoring\n\n### 3. **Music Analytics - Million Song Dataset** (on Databricks)\n- **Framework**: Spark Declarative Pipelines\n- **Data**: Million Song Dataset with 20 fields of artist, song, and audio features\n- **Architecture**: Medallion pattern with specialized silver tables and comprehensive gold analytics\n- **Silver Layer**: Domain-focused tables (`songs_metadata_silver`, `songs_audio_features_silver`) with comprehensive data quality validation\n- **Gold Layer**: 9 advanced analytics tables across temporal, artist, and musical analysis dimensions\n- **Analytics**: Artist discography analysis, temporal trends, musical characteristics, tempo/time signature patterns, comprehensive artist profiles\n- **Storage**: Delta tables with Unity Catalog integration and automatic data lineage\n- **Scale**: Production-ready streaming data processing with Auto Loader and extensive data quality rules\n\n## 🛠️ Technologies \u0026 Frameworks\n\n### Core Technologies\n- **PySpark 4.1.0.preview2**: Latest Spark features with Python API\n- **Spark Declarative Pipelines**: Building data processing pipelines with materialized views and streaming tables\n- **UV Package Manager**: Modern Python dependency management\n- **Faker**: Realistic synthetic data generation\n- **Plotly**: Interactive data visualizations\n\n### Architecture Patterns\n- **Declarative Pipelines**: SDP framework with Python decorators with `@dp.table` decorators\n- **Medallion Architecture**: Bronze/Silver/Gold data layers with specialized silver tables for domain separation\n- **Materialized Views**: Efficient data transformation caching and automatic dependency resolution\n- **Data Quality Framework**: Comprehensive validation rules with `@dp.expect` decorators for tempo, duration, and metadata validation\n- **Advanced Analytics**: Multi-dimensional gold layer tables combining temporal, artist, and musical analysis\n- **Shared Utilities**: Reusable data generation components across frameworks\n\n## 📋 Quick Reference\n\n### SDP Commands\n```bash\n# Environment setup\ncd src/py/sdp \u0026\u0026 uv sync\n\n# Run pipelines\npython main.py daily-orders # E-commerce analytics\npython main.py oil-rigs     # IoT sensor monitoring\n\n# Test utilities\nuv run sdp-test-orders      # Test order generation\nuv run sdp-test-oil-sensors # Test sensor data generation\n\n# Development commands\nuv run pytest              # Run tests\nuv run black .             # Format code\nuv run flake8 .            # Lint code\n```\n\n### SDP for Databricks \n```bash\n# Navigate to Music Analytics pipeline\ncd src/py/sdp2dbx/music_analytics\n\n# View pipeline documentation and architecture\ncat README.md\n\n# Deploy to Databricks workspace (requires Databricks environment)\n# See transformations/sdp_musical_pipeline.py for implementation\n```\n\n## 🎯 Learning Objectives\n\nThis repository demonstrates:\n\n1. **Framework Comparison**: SDP vs LSPD for different use cases and data processing paradigms\n2. **Data Generation**: Realistic synthetic data creation patterns with Faker library\n3. **Pipeline Architecture**: Declarative transformations, medallion architecture, and specialized table design\n4. **Quality Engineering**: Comprehensive data validation with `@dp.expect` rules and monitoring strategies\n5. **Advanced Analytics**: Multi-dimensional analysis combining temporal trends, artist profiling, and musical characteristics\n6. **Modern Tooling**: UV package management, Unity Catalog, Auto Loader, and latest Spark features\n7. **Production Patterns**: Streaming ingestion, environment management, and scalable deployment workflows\n\n## 📚 Documentation\n\n- **[SDP README.md](src/py/sdp/README.md)**: Comprehensive Spark Declarative Pipelines guide\n- **[Music Analytics SDP README](src/py/lsdp/music_analytics/README.md)**: Million Song Dataset Spark Declarative Pipelines implementation\n- **[CLAUDE.md](CLAUDE.md)**: Claude Code configuration for repository navigation\n\n## 🔧 Development Setup\n\n### Prerequisites\n- **Python 3.11+**: Required for all frameworks\n- **UV Package Manager**: Modern dependency management\n- **Java 11+**: Required by PySpark (handled automatically)\n- **Databricks Workspace**: Required for SPD pipelines on Databricks\n\n### Installation\n```bash\n# Install UV package manager\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Clone repository\ngit clone \u003crepository-url\u003e\ncd etl-pipelines\n\n# Setup SDP environment\ncd src/py/sdp \u0026\u0026 uv sync\n\n# Verify installation\nuv run python -c \"import pyspark; print('PySpark version:', pyspark.__version__)\"\n```\n\n## 💡 Best Practices Demonstrated\n\n### Code Organization\n- **Centralized Utilities**: Shared data generation functions\n- **Clear Separation**: Framework-specific implementations\n- **Configuration Management**: Environment-specific settings\n- **Comprehensive Testing**: Unit tests and validation scripts\n\n### Data Engineering\n- **Quality First**: Built-in data validation and monitoring\n- **Scalable Patterns**: From development to production\n- **Modern Tooling**: Latest framework features and best practices\n- **Documentation**: Comprehensive guides and examples\n\n### Pipeline Design\n- **Modularity**: Reusable components and transformations  \n- **Observability**: Metrics, logging, and monitoring\n- **Flexibility**: Support for both batch and streaming workloads\n- **Maintainability**: Clear structure and comprehensive documentation\n\n---\n\n*This repository provides practical examples of modern data engineering patterns, suitable for learning and development. Each framework demonstrates different strengths and use cases in the data processing ecosystem.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmatrix%2Fspark-declarative-pipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmatrix%2Fspark-declarative-pipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmatrix%2Fspark-declarative-pipelines/lists"}