https://github.com/amitnema/spark-etl-framework
A generic, multimodule framework for building scalable ETL (Extract, Transform, Load) processes using Apache Spark and Java. This framework is designed to be cloud-agnostic, maintainable, testable, and near-production-ready.
https://github.com/amitnema/spark-etl-framework
etl etl-framework etl-pipeline general-purpose generalization java-11 maven spark-3x spark-submit
Last synced: 6 days ago
JSON representation
A generic, multimodule framework for building scalable ETL (Extract, Transform, Load) processes using Apache Spark and Java. This framework is designed to be cloud-agnostic, maintainable, testable, and near-production-ready.
- Host: GitHub
- URL: https://github.com/amitnema/spark-etl-framework
- Owner: amitnema
- License: apache-2.0
- Created: 2025-09-02T18:36:59.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2025-09-11T05:31:05.000Z (about 1 month ago)
- Last Synced: 2025-09-11T08:53:05.874Z (about 1 month ago)
- Topics: etl, etl-framework, etl-pipeline, general-purpose, generalization, java-11, maven, spark-3x, spark-submit
- Language: Java
- Homepage:
- Size: 208 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://github.com/amitnema/spark-etl-framework/actions/workflows/maven.yml)
[](https://codecov.io/gh/amitnema/spark-etl-framework)
[](https://opensource.org/licenses/Apache-2.0)# Spark ETL Framework
**Author:** Amit Prakash Nema
A generic, multimodule framework for building scalable ETL (Extract, Transform, Load) processes using Apache Spark and
Java. This framework is designed to be cloud-agnostic, maintainable, testable, and production-ready.## Architecture Overview
The framework follows enterprise design patterns and consists of the following modules:
- **etl-core**: Core framework components including configuration, validation, transformation interfaces, and I/O
operations
- **etl-jobs**: Concrete job implementations and custom transformers## Architecture Diagram
```mermaid
flowchart TD
A[Job Config] --> B[ETLEngine]
B --> C[Reader]
B --> D[Transformer]
B --> E[Writer]
B --> F[Validation]
B --> G[Monitoring]
B --> H[Cloud Integration]
C -->|Reads Data| D
D -->|Transforms Data| E
E -->|Writes Data| F
F -->|Validates| G
G -->|Monitors| H
```## Key Features
- ✅ **Cloud Agnostic**: Supports AWS S3, Google Cloud Storage, Azure Blob Storage
- ✅ **Modular Design**: Separation of concerns with pluggable components
- ✅ **Configuration-Driven**: YAML-based job configuration with parameter substitution
- ✅ **Data Validation**: Built-in validation rules (NOT_NULL, UNIQUE, RANGE, REGEX)
- ✅ **Multiple Data Sources**: File systems, databases (JDBC), streaming sources
- ✅ **Format Support**: Parquet, JSON, CSV, Avro, ORC
- ✅ **Transformation Framework**: Abstract transformer pattern with custom implementations
- ✅ **Error Handling**: Comprehensive exception handling and logging
- ✅ **Production Ready**: Docker containerization, monitoring, and deployment scripts
- ✅ **Testing Support**: Unit and integration testing capabilities## Project Structure
```
spark-etl-framework/
├── etl-core/ # Core framework module
│ └── src/main/java/com/etl/core/
│ ├── config/ # Configuration management
│ ├── model/ # Data models and POJOs
│ ├── io/ # Data readers and writers
│ ├── validation/ # Data validation components
│ ├── transformation/ # Transformation interfaces
│ ├── factory/ # Factory pattern implementations
│ ├── exception/ # Custom exceptions
│ └── utils/ # Utility classes
├── etl-jobs/ # Job implementations
│ └── src/main/java/com/etl/jobs/
│ └── sample/ # Sample job implementation
├── dist/ # Distribution packages
│ └── sample-job/ # Ready-to-deploy job package
├── docker/ # Docker configuration
├── scripts/ # Build and deployment scripts
└── README.md # This file
```## Design Patterns Used
1. **Factory Pattern**: For creating readers, writers, and transformers
2. **Abstract Template Pattern**: For transformation pipeline
3. **Strategy Pattern**: For different I/O implementations
4. **Builder Pattern**: For configuration objects
5. **Singleton Pattern**: For configuration management
6. **Command Pattern**: For job execution## Getting Started
### Prerequisites
- Java 11 (or higher)
- Maven 3.6+
- Docker (for containerized deployment)### Build & Test
```sh
mvn clean verify
```This runs all unit/integration tests, code quality checks, coverage, and security scans.
### Configuration
All configuration values can be overridden via environment variables. Example:
```sh
export DATABASE_URL="jdbc:postgresql://prod-db:5432/etl_db"
export AWS_ACCESS_KEY="your-access-key"
export AWS_SECRET_KEY="your-secret-key"
```See `src/main/resources/application.properties` and `etl-jobs/src/main/resources/jobs/sample-job/job-config.yaml` for
all available variables.### Running Locally
```sh
java -jar etl-jobs/target/etl-jobs-*.jar
```### Running with Docker
Build and run the container:
```sh
docker build -t etl-framework:latest -f docker/Dockerfile .
docker run --rm -e DATABASE_URL="jdbc:postgresql://prod-db:5432/etl_db" etl-framework:latest
```### CI/CD Workflows
- **Build, Test, Quality, Security**: See `.github/workflows/maven.yml`
- **Docker Build & Publish**: See `.github/workflows/docker.yml`### Cloud Agnostic Deployment
- Supports AWS, GCP, Azure, and local deployments via environment variables.
- Containerized for Kubernetes, Docker Compose, or serverless platforms.## Development Guide
### Creating Custom Transformers
Extend the `AbstractDataTransformer` class:
```java
public class MyCustomTransformer extends AbstractDataTransformer {
@Override
protected Dataset doTransform(Dataset input, Map parameters) {
// Your transformation logic here
return input.filter(functions.col("status").equalTo("active"));
}
}
```### Adding New Data Sources
Implement the `DataReader` interface:
```java
public class MyCustomReader implements DataReader {
@Override
public Dataset read(InputConfig config) {
// Your data reading logic here
return dataset;
}
}
```### Configuration Management
The framework uses a hierarchical configuration approach:
1. `application.properties` - Application-wide settings
2. `job-config.yaml` - Job-specific configuration
3. Command-line parameters - Runtime overrides## Cloud Deployment
### AWS Configuration
```properties
cloud.provider=aws
aws.access.key=YOUR_ACCESS_KEY
aws.secret.key=YOUR_SECRET_KEY
aws.region=us-east-1
```### GCP Configuration
```properties
cloud.provider=gcp
gcp.project.id=your-project-id
gcp.service.account.key=/path/to/service-account.json
```### Azure Configuration
```properties
cloud.provider=azure
azure.storage.account=your-storage-account
azure.storage.access.key=your-access-key
```## Docker Deployment
```bash
# Build Docker image
cd docker/
docker-compose build# Run with Docker Compose
docker-compose up -d# Execute ETL job in container
docker-compose exec etl-framework ./sample-job/spark-submit.sh
```## Monitoring and Logging
The framework includes comprehensive logging using Logback:
- **Console Output**: Real-time job progress
- **File Logging**: Detailed logs with rotation
- **Structured Logging**: JSON format for log aggregation
- **Metrics**: Job execution metrics and validation results## Testing
```bash
# Run unit tests
mvn test# Run integration tests
mvn verify -P integration-tests
```## Performance Tuning
### Spark Configuration
Key configuration parameters in `application.properties`:
```properties
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
```### Memory Settings
Adjust memory settings in spark-submit script:
```bash
--driver-memory 4g
--executor-memory 8g
--executor-cores 4
```## Contributing
- Fork the repo and create a feature branch.
- Run `mvn clean verify` before submitting a PR.
- Ensure new code is covered by tests and passes code quality checks.## License
Apache 2.0
## Contact
For questions or support, open an issue or contact the maintainer
## Roadmap
- [ ] Kafka streaming support
- [ ] Delta Lake integration
- [ ] Advanced data profiling
- [ ] ML pipeline integration
- [ ] REST API for job management
- [ ] Web UI for monitoring---
Built with ❤️ using Apache Spark and Java