{"id":21721066,"url":"https://github.com/msamij/zig-flow","last_synced_at":"2026-05-07T06:33:24.854Z","repository":{"id":263101971,"uuid":"889341387","full_name":"msamij/zig-flow","owner":"msamij","description":"Data Engineering pipeline.","archived":false,"fork":false,"pushed_at":"2025-05-08T20:43:30.000Z","size":572,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-18T23:41:19.356Z","etag":null,"topics":["apache-spark","dataprocessing","distributed-computing"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msamij.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-16T05:48:14.000Z","updated_at":"2025-05-08T20:43:33.000Z","dependencies_parsed_at":"2025-01-01T11:22:59.421Z","dependency_job_id":"d896d015-e445-4e5d-9bc7-b93e834ac317","html_url":"https://github.com/msamij/zig-flow","commit_stats":null,"previous_names":["msamij/zig-flow"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/msamij/zig-flow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamij%2Fzig-flow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamij%2Fzig-flow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamij%2Fzig-flow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamij%2Fzig-flow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msamij","download_url":"https://codeload.github.com/msamij/zig-flow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msamij%2Fzig-flow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32726005,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-07T02:14:30.463Z","status":"ssl_error","status_checked_at":"2026-05-07T02:14:29.405Z","response_time":62,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","dataprocessing","distributed-computing"],"created_at":"2024-11-26T02:13:33.041Z","updated_at":"2026-05-07T06:33:24.849Z","avatar_url":"https://github.com/msamij.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Zigflow: A pipeline for datasets processing and for analytics\n\n## This project is a demonstration of building an end-to-end data processing pipeline using modern tools\n\n### Dataset: [The Netflix prize](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data)\n\n#### Project has five components\n\n1. Data Ingestion\n2. Processing of data sets using Apache Spark\n3. Analysis\n4. Storing datasets\n5. Visualization\n\n#### 1. Datasets are ingested from the file system using Spark\n\n#### 2. Datasets are then processed which includes cleaning, sanitizing and transforming it into structured format\n\n#### 3. When transformation is finised it then performs some basic analysis such as calculating averages, distributions and various descriptive stats\n\n#### 4. Datasets are then stored on file system in the (output) folder\n\n#### 5. Matplotlib is used for visualizing processed data and streamlit for building a web-dashboard\n\n**_Note_**: I've tested sparkpipeline on an Intel 4th Gen core i5 (4) cores processor having 8gb ram, to run the pipeline smoothly system must have atleat 8gb of ram you can run it on much slower system however it will take much more time for pipeline to finish processing.\n\n**_Spark Configuration_**: Can be changed in [scripts/run.py](scripts/run.py) file. By modifying DRIVER_MEMORY and SPARK_MASTER constants,\nI am currently running this locally with 3 threads on same JVM process with 6gb of driver memory.\nuse local[*] if your system have enough processing power for faster processing.\n\n## Running the application\n\n### First install python packages from requirements.txt file by creating a virtual environment in project root\n\n```shell\n# 1. Creates an environment.\npython3 -m venv .venv\n\n# 2. Activate the environment.\nsource .venv/bin/activate\n\n# 3. Install packages.\npip install -r requirements.txt\n```\n\n### Download the dataset and extract it to project root/datasets\n\n### 1. To run the standalone Spark Java pipeline\n\n#### Requirements\n\n1. Java 17 or above.\n2. Apache spark 3.5.1 or above.\n3. Python 3.8 or above.\n\n#### Then run the following when in project root\n\n```shell\npython scripts/run.py\n```\n\n### 2. To run the pipeline with the scheduler\n\nWhen in project root run the following python file.\n**I have scheduled the pipeline to run every 20 minutes.**\n\n```shell\npython scheduler/scheduler.py\n```\n\n### 3. To manage all the dependecies of the sparkpipeline I've created a Dockerfile in project root to run the pipeline via docker\n\n#### Run the following when project root to build the image and run the container\n\n```shell\nsudo docker build -t sparkpipeline:latest .\n\nsudo docker run -it --name sparkpipeline-container sparkpipeline\n\n# When inside the docker container run the following to run spark pipeline.\ncd sparkpipeline\n\nmvn clean \u0026 mvn install\n\n# Adjust number of threads and driver memory based on system config.\nspark-submit --master local[3] --driver-memory 6g --class com.msamiaj.zigflow.Main /app/sparkpipeline/target/sparkpipeline-1.0-SNAPSHOT.jar\n```\n\n### 4. To use plotter\n\n#### Run the following when project root to plot the datasets\n\n```shell\npython plot/plot.py\n```\n\n### 5. To use visualize datasets on web using streamlit\n\n#### Run the following when project root to plot the datasets on the web\n\n```shell\nstreamlit run web/visualization.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsamij%2Fzig-flow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsamij%2Fzig-flow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsamij%2Fzig-flow/lists"}