https://github.com/dmatrix/spark-declarative-pipelines
A collection of modern Spark Declarative Pipeline examples and implementations demonstrating different data processing paradigms and frameworks.
https://github.com/dmatrix/spark-declarative-pipelines
pyspark spark spark-declarative-pipelines
Last synced: about 1 month ago
JSON representation
A collection of modern Spark Declarative Pipeline examples and implementations demonstrating different data processing paradigms and frameworks.
- Host: GitHub
- URL: https://github.com/dmatrix/spark-declarative-pipelines
- Owner: dmatrix
- Created: 2025-09-12T22:53:54.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-11-23T20:03:50.000Z (7 months ago)
- Last Synced: 2025-11-23T22:06:27.449Z (7 months ago)
- Topics: pyspark, spark, spark-declarative-pipelines
- Language: Python
- Homepage:
- Size: 4.34 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark Declarative Pipelines - Comprehensive Data Engineering Examples
[](https://www.python.org/)
[](https://spark.apache.org/)
[](https://spark.apache.org/)
[](https://databricks.com/)
[](https://github.com/astral-sh/uv)
[](LICENSE)
[](https://github.com/psf/black)
[](https://github.com/username/etl-pipelines/graphs/commit-activity)
[](https://github.com/username/etl-pipelines)
[](https://opensource.org/)
---
A collection of modern Spark Declarative Pipeline (SDP) implementations demonstrating different data processing paradigms. This repository showcases both open source Spark Declarative Pipelines (SDP) for analytics workloads and [Spark Declarative Pipelines on Databricks](https://docs.databricks.com/aws/en/dlt/), formely called LDP, for data processing.
## 📁 Project Structure
```
etl-pipelines/
├── README.md # This overview document
├── CLAUDE.md # Claude Code configuration
└── src/py/
├── sdp/ # OSS Spark Declarative Pipelines examples
│ ├── README.md # Comprehensive SDP documentation
│ ├── daily_orders/ # E-commerce analytics pipeline
│ ├── oil_rigs/ # Industrial IoT monitoring pipeline
│ └── utils/ # Shared data generation utilities
├── lsdp/ # Lakeflow Spark Declarative Pipelines (on Databricks)
│ └── music_analytics/ # Million Song Dataset analytics pipeline
│ ├── README.md # Music analytics documentation
│ ├── images/ # Pipeline visualization assets
│ └── transformations/ # SDP transformation definitions
└── generators/ # Cross-framework data generators
```
## 🚀 Getting Started
### SDP - Spark Declarative Pipelines
Perfect for batch analytics and data science workloads using PySpark.
```bash
# Navigate to SDP examples
cd src/py/sdp
# Install dependencies with UV
uv sync
# Run Daily Orders e-commerce pipeline
python main.py daily-orders
# Run Oil Rigs sensor monitoring pipeline
python main.py oil-rigs
```
### LSDP - Lakeflow Spark Declarative Pipelines (on Databricks)
Ideal for streaming data processing with medallion architecture and data quality validation on Databricks Platform.
```bash
# Navigate to Music Analytics SDP example
cd src/py/lsdp/music_analytics
# Deploy pipeline to Databricks workspace
# Pipeline processes Million Song Dataset with medallion architecture
# See README.md for detailed implementation overview
```
## 📊 Use Cases Demonstrated
### 1. **Daily Orders E-commerce Analytics** (SDP)
- **Framework**: Spark Declarative Pipelines
- **Data**: Synthetic e-commerce orders with 20+ product categories
- **Features**: Order lifecycle management, sales tax calculations, business analytics
- **Storage**: Local Spark warehouse with Parquet files
- **Scale**: Development/testing workloads
### 2. **Oil Rigs Industrial Monitoring** (SDP)
- **Framework**: Spark Declarative Pipelines
- **Data**: IoT sensor data from Texas oil fields (temperature, pressure, water level)
- **Features**: Multi-location monitoring, statistical analysis, interactive visualizations
- **Storage**: Local Spark warehouse with time-series data
- **Scale**: Sensor analytics and operational monitoring
### 3. **Music Analytics - Million Song Dataset** (on Databricks)
- **Framework**: Spark Declarative Pipelines
- **Data**: Million Song Dataset with 20 fields of artist, song, and audio features
- **Architecture**: Medallion pattern with specialized silver tables and comprehensive gold analytics
- **Silver Layer**: Domain-focused tables (`songs_metadata_silver`, `songs_audio_features_silver`) with comprehensive data quality validation
- **Gold Layer**: 9 advanced analytics tables across temporal, artist, and musical analysis dimensions
- **Analytics**: Artist discography analysis, temporal trends, musical characteristics, tempo/time signature patterns, comprehensive artist profiles
- **Storage**: Delta tables with Unity Catalog integration and automatic data lineage
- **Scale**: Production-ready streaming data processing with Auto Loader and extensive data quality rules
## 🛠️ Technologies & Frameworks
### Core Technologies
- **PySpark 4.1.0.preview2**: Latest Spark features with Python API
- **Spark Declarative Pipelines**: Building data processing pipelines with materialized views and streaming tables
- **UV Package Manager**: Modern Python dependency management
- **Faker**: Realistic synthetic data generation
- **Plotly**: Interactive data visualizations
### Architecture Patterns
- **Declarative Pipelines**: SDP framework with Python decorators with `@dp.table` decorators
- **Medallion Architecture**: Bronze/Silver/Gold data layers with specialized silver tables for domain separation
- **Materialized Views**: Efficient data transformation caching and automatic dependency resolution
- **Data Quality Framework**: Comprehensive validation rules with `@dp.expect` decorators for tempo, duration, and metadata validation
- **Advanced Analytics**: Multi-dimensional gold layer tables combining temporal, artist, and musical analysis
- **Shared Utilities**: Reusable data generation components across frameworks
## 📋 Quick Reference
### SDP Commands
```bash
# Environment setup
cd src/py/sdp && uv sync
# Run pipelines
python main.py daily-orders # E-commerce analytics
python main.py oil-rigs # IoT sensor monitoring
# Test utilities
uv run sdp-test-orders # Test order generation
uv run sdp-test-oil-sensors # Test sensor data generation
# Development commands
uv run pytest # Run tests
uv run black . # Format code
uv run flake8 . # Lint code
```
### SDP for Databricks
```bash
# Navigate to Music Analytics pipeline
cd src/py/sdp2dbx/music_analytics
# View pipeline documentation and architecture
cat README.md
# Deploy to Databricks workspace (requires Databricks environment)
# See transformations/sdp_musical_pipeline.py for implementation
```
## 🎯 Learning Objectives
This repository demonstrates:
1. **Framework Comparison**: SDP vs LSPD for different use cases and data processing paradigms
2. **Data Generation**: Realistic synthetic data creation patterns with Faker library
3. **Pipeline Architecture**: Declarative transformations, medallion architecture, and specialized table design
4. **Quality Engineering**: Comprehensive data validation with `@dp.expect` rules and monitoring strategies
5. **Advanced Analytics**: Multi-dimensional analysis combining temporal trends, artist profiling, and musical characteristics
6. **Modern Tooling**: UV package management, Unity Catalog, Auto Loader, and latest Spark features
7. **Production Patterns**: Streaming ingestion, environment management, and scalable deployment workflows
## 📚 Documentation
- **[SDP README.md](src/py/sdp/README.md)**: Comprehensive Spark Declarative Pipelines guide
- **[Music Analytics SDP README](src/py/lsdp/music_analytics/README.md)**: Million Song Dataset Spark Declarative Pipelines implementation
- **[CLAUDE.md](CLAUDE.md)**: Claude Code configuration for repository navigation
## 🔧 Development Setup
### Prerequisites
- **Python 3.11+**: Required for all frameworks
- **UV Package Manager**: Modern dependency management
- **Java 11+**: Required by PySpark (handled automatically)
- **Databricks Workspace**: Required for SPD pipelines on Databricks
### Installation
```bash
# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone
cd etl-pipelines
# Setup SDP environment
cd src/py/sdp && uv sync
# Verify installation
uv run python -c "import pyspark; print('PySpark version:', pyspark.__version__)"
```
## 💡 Best Practices Demonstrated
### Code Organization
- **Centralized Utilities**: Shared data generation functions
- **Clear Separation**: Framework-specific implementations
- **Configuration Management**: Environment-specific settings
- **Comprehensive Testing**: Unit tests and validation scripts
### Data Engineering
- **Quality First**: Built-in data validation and monitoring
- **Scalable Patterns**: From development to production
- **Modern Tooling**: Latest framework features and best practices
- **Documentation**: Comprehensive guides and examples
### Pipeline Design
- **Modularity**: Reusable components and transformations
- **Observability**: Metrics, logging, and monitoring
- **Flexibility**: Support for both batch and streaming workloads
- **Maintainability**: Clear structure and comprehensive documentation
---
*This repository provides practical examples of modern data engineering patterns, suitable for learning and development. Each framework demonstrates different strengths and use cases in the data processing ecosystem.*