Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-15 00:25:38 UTC
- JSON Representation
https://github.com/turnipdo/spark-standalone-cluster-setup
To facilitate the initial setup of Apache Spark, this repository provides a beginner-friendly, step-by-step guide on setting up a master node and two worker nodes.
Last synced: 24 Dec 2024
https://github.com/abdoufermat5/twitter-analysis
Twitter data analysis using Pyspark
data-analysis pyspark spark twitter twitter-api
Last synced: 10 Jan 2025
https://github.com/flynn3103/loadhouse-toolkit
Loading data into the Lakehouse using JSON configuration and utilities for ETL tasks.
Last synced: 14 Feb 2025
https://github.com/pierrekieffer/hbaseconnector
Scala Spark Hbase Connector with filtered query
Last synced: 07 Feb 2025
https://github.com/pierrekieffer/dataframesparkstreaming
Easy implementation of Apache Spark Streaming for dataframes
dataframe spark spark-streaming
Last synced: 07 Feb 2025
https://github.com/piero24/big-data_hw_23-24
Exercises in Java and Spark for the Big Data Computing course at unipd
big-data clustering fft java mapreduce sampling spark streaming
Last synced: 08 Jan 2025
https://github.com/zsomborjoel/pyspark-basics
Teaching and learning the functionality of the Spark Python API on dataframes
Last synced: 11 Feb 2025
https://github.com/anras5/nyc-yellow-taxi
Processing data streams with Kafka + Spark
docker google-cloud kafka postgresql spark spark-streaming
Last synced: 21 Jan 2025
https://github.com/aelesbao/yelp-dataset-challenge
apache-spark cassandra spark yelp-dataset
Last synced: 05 Jan 2025
https://github.com/alexcombessie/ensae_scala-spark
Several mini-projects in Scala/Spark for the "Computer Science for the analysis of Big Data" course at ENSAE ParisTech
Last synced: 24 Dec 2024
https://github.com/banknatchapol/us-immigration-data-pipeline
Create Data Pipeline for US Imigration data using Spark.
Last synced: 27 Jan 2025
https://github.com/yc1999/scalasparkinaction-peopleyoumightknow
二度好友编程实验
cloudcomputing peopleyoumightknow scala spark
Last synced: 21 Jan 2025
https://github.com/justinjjlee/simulation-discrete
Employing data transformations and simulations to answer random questions
analytics data data-science julia python simulation spark
Last synced: 28 Jan 2025
https://github.com/sumanthvrao/ipl-spark-analysis
Predict outcomes of IPL Cricket Matches for the year 2018 using Spark MLLib framework.
decision-tree kmeans-clustering pyspark spark spark-mllib-library
Last synced: 08 Jan 2025
https://github.com/msampathkumar/scalaprojects
Scala Projects - From Scala basic learning tutorials to Big Data(Apache Spark) projects
Last synced: 01 Jan 2025
https://github.com/bryanbill/tracker
Wildlife animal tracking application
animals handlebars java postgresql spark
Last synced: 26 Dec 2024
https://github.com/sensu-plugins/sensu-plugins-spark
apache-spark metrics monitoring sensu-plugins spark
Last synced: 14 Feb 2025
https://github.com/omr5221/pyspark
data-engineer jupyter-notebook python spark
Last synced: 27 Jan 2025
https://github.com/chucheng92/structuredstreaming
Structured Streaming Demo
Last synced: 01 Feb 2025
https://github.com/ccao-data/service-spark-iasworld
Service for extracting tables from the CCAO system-of-record and uploading them to the Data Department's data warehouse
Last synced: 14 Feb 2025
https://github.com/hvignolo87/spark-examples
Spark examples
apache-spark pyspark python spark
Last synced: 14 Feb 2025
https://github.com/hussein-awala/stream-applications
A repository contains some examples for stream processing applications using spark structured streaming, Kafka Streams, and some other tools like Apache Hudi...
hudi kafka kafka-connect kafka-streams spark spark-streaming
Last synced: 01 Feb 2025
https://github.com/vitalibo/aws-glue-java
Simple PoC that demonstrate usage Java in AWS Glue ETL pipelines.
Last synced: 27 Dec 2024
https://github.com/casschow98/spotify_insights_project
Welcome to the Spotify Insights Data Pipeline Project where I analyze data from my Spotify listening history ~
airflow big-query data-analytics data-engineering docker etl pandas pyspark python song-analysis spark spotify-api terraform
Last synced: 14 Feb 2025
https://github.com/aravind2060/hr_data_analysis_with_spark_structured_api
This assignment helps students learn how to use filtering, grouping, aggregation operations
apache-spark docker docker-compose filtering spark
Last synced: 21 Jan 2025
https://github.com/samiksha-khare/crypto-real-time-analysis-using-kafka
This project showcases the process of streaming real-time cryptocurrency data using Kafka, storing the data in a MongoDB database, and visualizing the price trends over time with Python libraries like Matplotlib.
api cryptocurrency docker kafka kafka-consumer kafka-producer kafka-topic matplotlib mongodb nosql python real-time spark streamlit visualization zookeeper
Last synced: 21 Jan 2025
https://github.com/edwin-huber/sparkandlivyonaksspot
Repo to support POC Documentation for the use of Livy, AKS and Azure Spot Instances
Last synced: 14 Jan 2025
https://github.com/vitalibo/distributed-alarm-system
Simple distributed alarm system on top of Apache Spark
Last synced: 27 Dec 2024
https://github.com/riolaf05/spark-elasticsearch-recommendation
Recommendation system using Alternating Least Squares(ALS) and Cosine Similarity on PySpark and Elasticsearch
collaborative-filtering docker elasticsearch machine-learning pyspark recommendation-system spark
Last synced: 21 Jan 2025
https://github.com/feliciamarlove/streaming-with-scala-and-spark
Related to Handling Fast Data with Apache Spark SQL and Streaming course on Pluralsight https://app.pluralsight.com/library/courses/apache-spark-sql-fast-data-handling-streaming/exercise-files
data-engineering hive parquet scala spark streaming
Last synced: 11 Feb 2025
https://github.com/shink/spark-ml-algorithm-docker
Spark ML algorithms on docker
Last synced: 01 Feb 2025
https://github.com/mauroslucios/pysparkwithpython
https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession
docker jupyter-notebook linux notebook pyspark python spark visual-studio-code
Last synced: 14 Feb 2025
https://github.com/thanhvie/de_udacity_capstone
Project: This is the capstone project of my nanodegree data engineering program at Udacity
Last synced: 14 Jan 2025
https://github.com/holgerbrandl/spark_image_labeling
Image labeling experiments using apache spark
Last synced: 21 Jan 2025
https://github.com/jakipatryk/spark-persistent-homology
(WIP, not fast enough for any production usage yet) Library for persistent homology computations in Apache Spark.
persistent-homology spark tda topological-data-analysis
Last synced: 30 Dec 2024
https://github.com/yuhexiong/oracle-data-pipeline-spark-python
apache-doris apache-spark doris oracle pipeline spark
Last synced: 21 Jan 2025
https://github.com/ahmedennaifer/iot-streaming-platform
WIP
docker docker-compose iot kafka postgresql python scala spark streaming
Last synced: 09 Feb 2025
https://github.com/n370/spark-quickstart
A barebones seed repository for starting Spark Java micro framework projects.
java micro-framework spark sparkjava sparkjava-framework
Last synced: 21 Jan 2025
https://github.com/bakansm/vithsd
Vietnamese Hate Speech Detection with real-time data from streaming platform such as Youtube, Facebook and Tiktok.
kafka machine-learning nlp real-time-data spark streaming-data
Last synced: 14 Jan 2025
https://github.com/plave0/pp
Programming paradigms course material.
functional-programming haskell prolog python scala spark
Last synced: 07 Jan 2025
https://github.com/karimosman89/iot-predictive-maintenance
This repository will simulate an IoT-based predictive maintenance system designed to monitor industrial equipment through sensors. It will include data ingestion, processing, and machine learning components to predict potential failures, optimizing maintenance schedules and reducing downtime.
api cloud-platform dashboard data-collection data-processing deployment iot-platform predictive-analytics pressure-sensor real-time-sensor sensors spark temperature-sensor vibration
Last synced: 05 Jan 2025
https://github.com/manoharvit/ecommerce-dive-deep-sales-analysis
In this project, we developed an ETL pipeline using Apache Airflow to process delivery data and track delayed shipments. The pipeline downloads data from an AWS S3 bucket, cleans it using Spark/Spark SQL to identify missing delivery deadlines, and uploads the cleaned dataset back to S3. This ensures efficient delivery performance tracking.
airflow airflow-dags ecommerce elt pyspark s3 s3-bucket spark sql
Last synced: 14 Feb 2025
https://github.com/phelipe-sempreboni/programming
Repository for programming languages of various types.
css html java javascript python spark vba
Last synced: 14 Feb 2025
https://github.com/joekakone/get-started-with-pyspark
PySpark Tutorials for Beginners
Last synced: 14 Jan 2025
https://github.com/abdelmajidlh/spark_practices
Cas pratiques d'utilisation de Apache Spark avec scala.
apach-spark apache machine-learning scala spark
Last synced: 27 Jan 2025
https://github.com/oguzhanfatihkucuk/data-analytics-project-kafka-spark
The data in this project was collected in a database using Apache Kafka and processed with Apache Spark Streaming. The project aims to create a forecasting model and analyze sales forecasts per customer.
big-data data data-visualization hadoop kafka ml mlpipeline plt pyhton spark
Last synced: 25 Dec 2024
https://github.com/asolimando/map-spark
Playing around with Map datatype in Spark
Last synced: 01 Feb 2025
https://github.com/marianna-konstantopoulou/analyticsportfolio
Projects focused on data analysis, visualization, and data-driven problem-solving
airflow bigdata classification clustering data-science dataanalytics dataengineering datavisualization kafka logistic-regression machinelearning mongodb python r redis spark sql ssis statistics
Last synced: 25 Dec 2024
https://github.com/leo-the-nardo/combopurifier
Data Pipeline from AWS SQS/S3 to Kubernetes w/ Spark using Airflow, EKS & Data Lakehouse
airflow argocd aws-glue-catalog aws-lake-formation aws-s3 aws-sqs data-lake delta-lake eks minio spark terraform
Last synced: 05 Jan 2025
https://github.com/bousettayounes/real-time-processing-of-users-data
Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system
airflow cassandra dataengineering datastreaming docker kafka postgresql spark streaming
Last synced: 05 Jan 2025
https://github.com/aelesbao/twitter-analyser
Analyses streams of tweets using Kafka and Apache Spark
kafka scala spark spark-streaming
Last synced: 05 Jan 2025
https://github.com/clivern/monk
🔥Easy To Use Chef Cookbooks To Automate Boring Stuff
apache automation chef chef-cookbook chef-recipes consul django haproxy infrastructure-automation java laravel linux mysql nginx php python spark symfony
Last synced: 17 Jan 2025
https://github.com/ranimbenmbarek/airplane-crash-data-streaming
The Airplane Crash Data Analysis & Visualization project uses Kafka and Spark for streaming analysis of airplane crashes from 1908 to 2009, enabling real-time insights into trends and patterns in aviation safety.
hbase hdfs kafka kaggle powerbi spark
Last synced: 03 Feb 2025
https://github.com/sevmardi/spark-experiments
Playground for Spark
apache-spark machine-learning spark
Last synced: 25 Dec 2024
https://github.com/ranimbenmbarek/airplane-crash-batch-analysis
The Airplane Crash Data Analysis & Visualization project utilizes Spark for batch processing of historical airplane crashes from 1908 to 2009. This approach enables the analysis of trends in aviation safety by examining patterns in crash occurrences and fatalities within the dataset.
Last synced: 03 Feb 2025
https://github.com/flynn3103/is405.m21-bigdata
Implement Streaming on Big Data using Kafka, HDFS, Spark, ..
Last synced: 21 Jan 2025
https://github.com/ac-gomes/spark-iceberg-hive
apache-iceberg apache-spark hive-metastore iceberg minio spark trino
Last synced: 14 Dec 2024
https://github.com/29dch/myscalacodesaboutbigdata
My Scala learning code with bigdata
actor akka kafka scala spark spark-sql spark-streaming
Last synced: 10 Jan 2025
https://github.com/varunu28/chicago-crime-data-analysis
An analysis of Chicago crime data using Apache Spark
Last synced: 01 Jan 2025
https://github.com/yannbolliger/g5-parallel-sgd-spark
Group 5 project for Systems for Data Science course @ EPFL, 2019.
Last synced: 01 Jan 2025
https://github.com/shrikantnaidu/apache-spark
Spark and Spark ML notebooks
spark spark-ml spark-sql udacity
Last synced: 10 Jan 2025
https://github.com/ilieschibane/projet-iot-cloud-bigdata
Implémentation d'une pipeline permettant de faire la prédiction de la maladie de parkinson via des outils d'IoT, Cloud, et Big Data
big-data cassandra cloud flask hadoop-hdfs iot kafka machine-learning mongodb mqtt python rest-api sickit-learn spark
Last synced: 21 Jan 2025
https://github.com/diiblo/us_election_project
Ce projet analyse les données des élections américaines 2024 avec Spark, PostgreSQL et Jupyter, dans une architecture maître-esclaves déployée via Docker.
docker jupiter-notebook postgresql-database pyspark spark
Last synced: 02 Feb 2025
https://github.com/twseptian/apache-pyspark-programming
Big Data Python Programming using Apache Spark and Pyspark
apache pyspark pyspark-mllib pyspark-notebook pyspark-tutorial spark
Last synced: 26 Dec 2024
https://github.com/amoghkori/working-with-apache-spark-mllib
Implemented Apache Spark MLLib to analyze a large car dataset, predict car selling prices, and gain insights into the car market.
amazon-web-services data-analysis data-visualization exploratory-data-analysis linear-regression machine-learning model-selection pyspark python random-forest sagemaker spark
Last synced: 23 Jan 2025
https://github.com/sabaudian/amd_market_basket_analysis
Algorithms for Massive Datasets (AMD) -- Market-baskets analysis project
frequent-itemsets mapreduce market-basket-analysis massive-datasets pyspark python python-3 spark
Last synced: 01 Jan 2025
https://github.com/sandrain/spider2-snapshot-anon
A spark script for processing (large-scale) file system snapshot data.
filesystem magpie spark workload-analysis
Last synced: 10 Jan 2025
https://github.com/zuston/note
the problems and notes
distributed-systems java multithreading python scala spark
Last synced: 01 Feb 2025
https://github.com/nfo94/infraestrutura-cassandra-pd
cassandra jupyter python spark
Last synced: 26 Dec 2024
https://github.com/giordano-lucas/movie-recommender
Movie Recommendation System (KNN) in Scala/Spark
recommander-system scala spark
Last synced: 02 Jan 2025
https://github.com/zemuldo/spark-tutorial
Jupiter notebooks for my spark tutorials.
Last synced: 27 Jan 2025
https://github.com/amirhnajafiz-university/s7cc03
Third project of Cloud Computing course.
big-data hadoop hadoop-hdfs mapreduce python python3 spark
Last synced: 26 Dec 2024
https://github.com/plandes/docker-spark-service
Extends the docker-spark, which creates a Spark cluster in a docker image by making it a service.
docker docker-spark dockerfile spark
Last synced: 02 Jan 2025
https://github.com/naramsim/dynamic-twitter-geographical-categorization
A map-reduce implementation for the categorization of Twitter tweets within dynamic geographical boundaries.
Last synced: 26 Dec 2024
https://github.com/akhich551995/data-streaming-project-airflow-kafka-spark-t-cassandra-docker
building a real-time data streaming pipeline, covering each phase from data ingestion to processing and finally storage. We'll utilize a powerful stack of tools and technologies, including Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra—all neatly containerized using Docker.
airflow airflow-dags cassandra docker kafka postgresql python spark zookeeper
Last synced: 21 Jan 2025
https://github.com/saadkh1/real-time_sales_data_pipeline_kafa_spark_cassandra_redash
This repository implements a real-time sales data pipeline leveraging Apache Kafka, Apache Spark, Apache Cassandra, and Redash. It facilitates the efficient ingestion, processing, storage, and visualization of sales data streams.
cassandra fastapi kafka redash spark
Last synced: 21 Jan 2025
https://github.com/yandex-cloud-examples/yc-data-proc-spark-pyspark
Запуск и управление приложениями для Spark и PySpark в сервисе Yandex Data Proc.
data-proc pyspark spark yandex-cloud yandexcloud
Last synced: 29 Dec 2024
https://github.com/sowrabh-m/data_processing_using_spark_flink
This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both locally and on AWS EMR.
aws aws-emr aws-s3 emr-cluster flink flink-stream-processing spark spark-flink spark-streaming
Last synced: 14 Feb 2025
https://github.com/jcguidry/flight-ml-preprocess-gcp
Continuous flight event data processing using Spark Streaming, Delta Lake storage, deployed on GCP dataproc cluster.
dataproc deltalake gcp spark spark-streaming
Last synced: 28 Dec 2024
https://github.com/tolgakmbl/sparkwithscalapractices
Spark with Scala to read MongoDb and printing documents on console
Last synced: 08 Feb 2025
https://github.com/rmodi6/theory-of-database-systems
Homework files for CSE532 - Theory of Database Systems
database-queries hadoop ibm-db2 jdbc map-reduce spark spatial-database sql xpath xquery
Last synced: 11 Jan 2025
https://github.com/ac-gomes/systemctl_spark_jupyter-notebook
systemctl for Spark and Jupyter-notebook
jupyter-notebook spark systemctl systemd
Last synced: 02 Jan 2025
https://github.com/cwienberg/spark-async-map
Helper library for running blocking IO operations in Spark jobs more efficiently
Last synced: 02 Feb 2025
https://github.com/oscarfmdc/spark-flight-delay
A model capable of predicting the arrival delay time of a commercial flight, given a set of parameters known at time of take-off.
Last synced: 01 Feb 2025
https://github.com/elaaatif/jpeg-and-jpeg2000-compression-on-multi-node-cluster-using-hadoop-and-spark
Big Data technologies can be leveraged for efficient, distributed image compression using JPEG2000 (Spark) and JPEG (MapReduce).
cluster hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce image-compression spark
Last synced: 08 Feb 2025
https://github.com/pandabear-neil/microsoft_fabric_gems
Code Snippets, designs, and thinkings around the Microsoft Fabric Platform
data-engineering data-factory data-science data-warehouse microsoft-fabric spark
Last synced: 08 Feb 2025
https://github.com/asolimando/trap2017spark
Analysis of TRAP2017 dataset using Spark
artificial-intelligence graph-analysis machine-learning spark
Last synced: 01 Feb 2025
https://github.com/euiyounghwang/spark_job_interface_service
spark_job_interface_service
fastapi spark spark-cluster spark-jobs
Last synced: 17 Jan 2025