https://github.com/msamij/zig-flow

Data Engineering pipeline.
https://github.com/msamij/zig-flow

apache-spark dataprocessing distributed-computing

Last synced: about 1 month ago
JSON representation

Data Engineering pipeline.

Host: GitHub
URL: https://github.com/msamij/zig-flow
Owner: msamij
Created: 2024-11-16T05:48:14.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-08T20:43:30.000Z (about 1 year ago)
Last Synced: 2025-06-18T23:41:19.356Z (12 months ago)
Topics: apache-spark, dataprocessing, distributed-computing
Language: Java
Homepage:
Size: 559 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Zigflow: A pipeline for datasets processing and for analytics

## This project is a demonstration of building an end-to-end data processing pipeline using modern tools

### Dataset: [The Netflix prize](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data)

#### Project has five components

1. Data Ingestion
2. Processing of data sets using Apache Spark
3. Analysis
4. Storing datasets
5. Visualization

#### 1. Datasets are ingested from the file system using Spark

#### 2. Datasets are then processed which includes cleaning, sanitizing and transforming it into structured format

#### 3. When transformation is finised it then performs some basic analysis such as calculating averages, distributions and various descriptive stats

#### 4. Datasets are then stored on file system in the (output) folder

#### 5. Matplotlib is used for visualizing processed data and streamlit for building a web-dashboard

**_Note_**: I've tested sparkpipeline on an Intel 4th Gen core i5 (4) cores processor having 8gb ram, to run the pipeline smoothly system must have atleat 8gb of ram you can run it on much slower system however it will take much more time for pipeline to finish processing.

**_Spark Configuration_**: Can be changed in [scripts/run.py](scripts/run.py) file. By modifying DRIVER_MEMORY and SPARK_MASTER constants,
I am currently running this locally with 3 threads on same JVM process with 6gb of driver memory.
use local[*] if your system have enough processing power for faster processing.

## Running the application

### First install python packages from requirements.txt file by creating a virtual environment in project root

```shell
# 1. Creates an environment.
python3 -m venv .venv

# 2. Activate the environment.
source .venv/bin/activate

# 3. Install packages.
pip install -r requirements.txt
```

### Download the dataset and extract it to project root/datasets

### 1. To run the standalone Spark Java pipeline

#### Requirements

1. Java 17 or above.
2. Apache spark 3.5.1 or above.
3. Python 3.8 or above.

#### Then run the following when in project root

```shell
python scripts/run.py
```

### 2. To run the pipeline with the scheduler

When in project root run the following python file.
**I have scheduled the pipeline to run every 20 minutes.**

```shell
python scheduler/scheduler.py
```

### 3. To manage all the dependecies of the sparkpipeline I've created a Dockerfile in project root to run the pipeline via docker

#### Run the following when project root to build the image and run the container

```shell
sudo docker build -t sparkpipeline:latest .

sudo docker run -it --name sparkpipeline-container sparkpipeline

# When inside the docker container run the following to run spark pipeline.
cd sparkpipeline

mvn clean & mvn install

# Adjust number of threads and driver memory based on system config.
spark-submit --master local[3] --driver-memory 6g --class com.msamiaj.zigflow.Main /app/sparkpipeline/target/sparkpipeline-1.0-SNAPSHOT.jar
```

### 4. To use plotter

#### Run the following when project root to plot the datasets

```shell
python plot/plot.py
```

### 5. To use visualize datasets on web using streamlit

#### Run the following when project root to plot the datasets on the web

```shell
streamlit run web/visualization.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/msamij/zig-flow

Awesome Lists containing this project

README