Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tuancamtbtx/etl-spark-k8s
ETL With Apache Spark Deployed on K8s
https://github.com/tuancamtbtx/etl-spark-k8s
apache k8s spark spark-sql spark-streaming
Last synced: about 1 month ago
JSON representation
ETL With Apache Spark Deployed on K8s
- Host: GitHub
- URL: https://github.com/tuancamtbtx/etl-spark-k8s
- Owner: tuancamtbtx
- License: mit
- Created: 2024-05-26T13:02:28.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-07-28T14:27:51.000Z (7 months ago)
- Last Synced: 2024-11-09T02:37:54.497Z (3 months ago)
- Topics: apache, k8s, spark, spark-sql, spark-streaming
- Language: TypeScript
- Homepage:
- Size: 5.54 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# ETL With Spark On K8S
This is a project to manage ETL Jobs with Spark include:- **Frontend**: Monitor Spark ETL Job and Visualize Data lineage Chart, Customize Yaml (Spark Generator Yaml) Configuration
- **Backend**: Manage Data Job, Logs
- **Spark Generator**: This is a tool , which to generte Spark ETL Job from yaml configuration.[![Docker Pulls](https://badgen.net/docker/pulls/vantuan12345/spark-generator?icon=docker&label=pulls)](https://hub.docker.com/r/vantuan12345/spark-generator/)
[![Docker Stars](https://badgen.net/docker/stars/vantuan12345/spark-generator?icon=docker&label=stars)](https://hub.docker.com/r/vantuan12345/gospark-generatordaddypy/)
[![Docker Image Size](https://badgen.net/docker/size/vantuan12345/spark-generator?icon=docker&label=image%20size)](https://hub.docker.com/r/vantuan12345/spark-generator/)
![Github stars](https://badgen.net/github/stars/tuancamtbtx/etl-spark-k8s?icon=github&label=stars)
![Github forks](https://badgen.net/github/forks/tuancamtbtx/etl-spark-k8s?icon=github&label=forks)
![Github issues](https://img.shields.io/github/issues/tuancamtbtx/etl-spark-k8s)
![Github last-commit](https://img.shields.io/github/last-commit/tuancamtbtx/etl-spark-k8s)## Backend
TODO## FrontEnd
### View List Spark Job
![]()
### Backfill Spark Job
![]()
### Data Lineage Visualize
![]()
### Submit Spark Job
![]()
# Spark Generator
### Configuration
Configuration for the Spark pipeline is loaded from a YAML file. Ensure you have a `spark-pipeline-config.yaml` file in the resources directory with the appropriate settings.
**Spark Job Pipeline Config**
Example `spark-pipeline-config.yaml`:
```yaml
apiVersion: "v1"
kind: SparkBatchPipeline
spec:
jobName: "ExampleSparkJob"
master: "local[*]"
appName: "Spark Ingest Transform Sink Job"
javaClass: "com.tc.bigdata.tool.app.Processor"
dependencies:
- "path/to/your/jarfile.jar"
configurations:
spark.executor.memory: "2g"
spark.driver.memory: "1g"
spark.executor.cores: "2"
steps:
- name: "Ingest"
type: "source"
format: "csv"
options:
path: "/path/data/input.csv"
header: "true"
inferSchema: "true"
delimiter: ","
encoding: "UTF-8"
- name: "Transform"
type: "transformation"
operations:
- operation: "filter"
condition: "age > 40"
- operation: "select"
columns: ["name", "age", "address"]
- operation: "withColumn"
column: "newColumn"
expression: "columnA + columnB"
- name: "Sink"
type: "write"
format: "parquet"
options:
path: "/output"
mode: "overwrite"
```
**K8S Spark Job Config**
Example `k8s_spark_job.yaml`:
```yaml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-etl
namespace: default
spec:
type: Java
mode: cluster
image: "ghcr.io/tuancamtbtx/spark-build-tool:main"
env:
- name: SPARK_JOB_CONF_PATH
value: "your_spark_pipeline_job_path_conf"
imagePullPolicy: Always
mainClass: com.tc.bigdata.tool.app.Processor
mainApplicationFile: "local:///opt/spark/spark-build-tool.jar"
sparkVersion: "3.5.1"
sparkUIOptions:
serviceLabels:
test-label/v1: 'true'
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.5.1
serviceAccount: spark-operator-spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.5.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
```## Contributing
The project has a separate contribution file. Please adhere to the steps listed in the separate contributions [file](./CONTRIBUTING.md)## License
[![Licence](https://img.shields.io/github/license/Ileriayo/markdown-badges?style=for-the-badge)](./LICENSE)