Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-14 00:24:01 UTC
- JSON Representation
https://github.com/ltossian/bike-sales-data-metrics
Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.
fastapi grafana hadoop kafka postgresql python spark
Last synced: 11 Feb 2025
https://github.com/iversonson/spark-lite-document-translator
This project aims to provide a fast and efficient document translation solution using Spark Lite's machine learning APIs
Last synced: 17 Jan 2025
https://github.com/crazybber/go-jupyter
spark big data exploring in jupyterlab
bigdata jupyter-notebook jupyterlab rdd spark
Last synced: 28 Jan 2025
https://github.com/azlinrusnan/movielens_data_analysis_with_mongodb_and_cassandra
This project presents an analysis of the MovieLens 100k dataset using Apache Spark integrated with MongoDB and Cassandra. The dataset includes user information, movie ratings, and movie details, providing a comprehensive basis for exploring user preferences and movie popularity.
cassandra ml-100k mongodb python spark
Last synced: 17 Jan 2025
https://github.com/divithraju/divith-raju-data-mining
This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.
algorthims analytics apache business client connector data dataarchitecture database dataengineering datamining datascience hadoop k-means-clustering mysql project project-repository pyspark python3 spark
Last synced: 17 Jan 2025
https://github.com/inf0rmatiker/model-service
A service providing federated model training for spatially-segregated data.
Last synced: 08 Jan 2025
https://github.com/sebastianruizm/pyspark-graphframes
Análisis de datos con GraphFrames y PySpark
Last synced: 08 Jan 2025
https://github.com/zoltan-nz/learning-spark
Playing with Apache Spark
apache-spark java map-reduce spark
Last synced: 22 Jan 2025
https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks
San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.
Last synced: 16 Nov 2024
https://github.com/chukwuemekaaham/uber-gcp-etl-project
Data Engineering Zoomcamp Final Project
bigquery cloud-storage csv docker-compose gcp jupyter-notebook looker-studio mageai python spark spreadsheets terraform
Last synced: 10 Jan 2025
https://github.com/mohnoor94/learningspark
My journey to learn Spark using Scala <3
learning learning-by-doing scala spark sparkscala
Last synced: 22 Jan 2025
https://github.com/ejw-data/google-colab-etl-amazon-reviews
Using Spark and Amazon RDS to clean and summarize amazon reviews to determine usefulness of product feedback
Last synced: 22 Jan 2025
https://github.com/sandeepkundalwal/network-load-analysis-using-apache-spark
[CS561: MapReduce & BigData] Streaming Service using Apache Spark
big-data css html java javascript mapreduce spark
Last synced: 02 Feb 2025
https://github.com/tupol/spark-utils-demos
Demos for the tupol/spark-utils project together with a storyline
configuration demo framework scala spark
Last synced: 17 Jan 2025
https://github.com/tupol/spark-apps.seed.g8
Create Spark applications projects based on the spark-utils library.
application scala spark template
Last synced: 17 Jan 2025
https://github.com/amthorn/qutex
A basic Queue Management System, interactable via several mediums, that resembles a mutex.
ava bot bots cisco cisco-spark cisco-spark-bot mutex queue queuebot queues qutex spark thorn webex webex-teams
Last synced: 13 Nov 2024
https://github.com/samuele-lolli/steam-recommendation-system
A basic recommendation system built with Scala and Spark
Last synced: 04 Feb 2025
https://github.com/pprattis/road-safety-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.
computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student
Last synced: 04 Feb 2025
https://github.com/pprattis/insurance-company-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.
computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student
Last synced: 04 Feb 2025
https://github.com/rockfordwei/anagram
Anagram Solution Servers in Different Languages/Frameworks
anagram hdfs java javascript php python server spark swift
Last synced: 12 Jan 2025
https://github.com/alimarzouk/paris-aq
ELTL pipeline to monitor air quality in the Paris Île-de-France area
airflow airquality big-data bigquery dataengineering gcs spark
Last synced: 22 Jan 2025
https://github.com/abtinz/cloud-computing
cassandra cassandra-driver cloud-computing docker elasticsearch hadoop hdfs kubernetes redis spark
Last synced: 21 Jan 2025
https://github.com/m-molaei/twitter-sentiment-analysis-using-apache-spark-
Sentiment analysis using deep learning models and FastText embedding on Apache Spark
apache-cassandra apache-spark big-data fasttext fasttext-embeddings mongodb pyspark rdd sentiment-analysis sentiment140-dataset spark
Last synced: 21 Jan 2025
https://github.com/angeligareta/spark-hadoop-hbase-overview
First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.
apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql
Last synced: 22 Jan 2025
https://github.com/angeligareta/spark-kafka-cassandra-overview
Second lab for Data-Intensive Computing course at KTH where we use Apache Kafka, Spark, and Cassandra to practice stream processing.
apache-kafka apache-spark cassandra cassandra-server data-intensive id2221 kafka kafka-topic kth scala spark stream-processing
Last synced: 22 Jan 2025
https://github.com/angeligareta/machine-learning-spark
Assignment for Scalable Machine Learning which aims to study the basics of regression and classification in Spark.
apache-spark machine-learning scala spark spark-classification spark-ml spark-mllib spark-regression spark-scala
Last synced: 22 Jan 2025
https://github.com/tomwhite/single-cell-spark-demo
Experiments on Single Cell data from 10x Genomics using Apache Spark.
bioinformatics genomics single-cell spark
Last synced: 17 Jan 2025
https://github.com/georgegkonis/spark-decentralized-query-processing
Project for the academic course "Decentralized Data Technologies"
big-data decentralized-data jupyter python query-optimization spark
Last synced: 12 Feb 2025
https://github.com/coreyauger/ashley-madison-spark
Spark data analysis for the Ashley Madison dataset.
Last synced: 16 Jan 2025
https://github.com/wadiebenabdouh/socialmedia-usage-pipeline
Data from Kaggle, containing wide range of users with different age, gender, and interest.
apache-spark data-visualization jupyter-notebook kaggle pyspark python spark
Last synced: 16 Jan 2025
https://github.com/azlinrusnan/iris_pyspark_analysis
Iris Classification using PySpark
apache pyspark-mllib python r spark
Last synced: 31 Dec 2024
https://github.com/ralgond/bigdata-example
Hadoop、Hive和Spark的例子、细节和注意事项
bigdata hadoop hdfs hive map-reduce spark
Last synced: 09 Jan 2025
https://github.com/firefly55lm/superconductors_critical_temperature_analysis
Academic project for Big Data Laboratory
chemistry docker machine-learning physics pyspark spark
Last synced: 25 Jan 2025
https://github.com/kampi/particle-mqtt
MQTT client implementation for TCP supporting devices (i. e. Argon, Photon) from Particle IoT.
cpp mqtt particle-argon particle-iot particle-swarm-optimization spark
Last synced: 21 Jan 2025
https://github.com/fiware/tutorials.big-data-spark
:blue_book: FIWARE 306: Real-time Processing of Context Data using Apache Spark
apache-spark big-data-analytics fiware fiware-cosmos orion-spark-connector spark tutorial
Last synced: 17 Nov 2024
https://github.com/aamend/spark-archetype
Maven archetype is a convenient way to create fully fledged SPARK libraries at minimal cost
Last synced: 29 Jan 2025
https://github.com/harborzeng/gangsutils
Scala spark project useful tool pack
Last synced: 29 Jan 2025
https://github.com/snexus/streaming-playground
Exploring streaming design patterns with Kafka and Spark Structural Streaming
kafka kafka-producer python spark spark-streaming
Last synced: 23 Jan 2025
https://github.com/brooksian/twittersentimentsparkcorenlp
Twitter Sentiment Analysis Using Spark CoreNLP
nlp-machine-learning spark sparksql zeppelin-notebook
Last synced: 18 Nov 2024
https://github.com/ngone51/spark-read
This is a project recording the reading process of Spark(v2.4) source code personally.
Last synced: 18 Nov 2024
https://github.com/darenr/spark-pca
Dimensional reduction, Scatter, Hexbin and kde plots
Last synced: 05 Feb 2025
https://github.com/evegen55/car_number_recognizer
computer-vision neural-networks spark
Last synced: 22 Jan 2025
https://github.com/dunnkers/pyspark-bucketmap
Easily group pyspark data into buckets and map them to different values.
bucketizer categorizer pyspark pyspark-mllib python python3 spark
Last synced: 29 Jan 2025
https://github.com/izeigerman/twinkle
The collection of helpers and utils for Apache Spark
Last synced: 08 Feb 2025
https://github.com/giuliosmall/twitter-trending-topics-pipeline
This project demonstrates trending topic detection using Apache Spark and MinIO. It processes Twitter JSON data with PySpark, leveraging distributed data processing and cloud storage. The entire project is containerized with Docker for easy deployment across architectures.
docker minio nlp pyspark pytest spacy spark streamlit
Last synced: 05 Feb 2025
https://github.com/annettaqi/spam-detection
Using Stochastic gradient descent to classify emails into spam or ham
spark stochastic-gradient-descent
Last synced: 13 Feb 2025
https://github.com/binwenwu/oge-computation-ogc
A computing project corresponding to an OGC style API
Last synced: 13 Feb 2025
https://github.com/abdellatif-laghjaj/big-data-project
Big data and image processing project
big-data facedetection image-preprocessing image-processing pyspark realtime-detection spark
Last synced: 17 Jan 2025
https://github.com/mahi97/internship-elk-loganalysis
~ The Report of Development and Deployment an ELK Stack for MCI BI softwares and servers to perform real-time log analysis
elasticsearch kafka kibana latex logstash mesos redis spark
Last synced: 05 Feb 2025
https://github.com/danimonsalve/scala_spark
Aplicación en Scala que utiliza Apache Spark para clasificar ofertas de empleo según los lenguajes de programación mencionados en las ofertas de empleo. El objetivo es demostrar diferentes técnicas de clasificación y procesamiento de datos en grandes volúmenes de datos.
Last synced: 17 Jan 2025
https://github.com/vubacktracking/stream-data-processing
Streaming data processing pipeline using Spark, PostgreSQL, Debezium, Kafka, Minio, Delta Lake, Trino and DBeaver
dbeaver debezium delta-lake kafka spark spark-streaming stream-processing trino
Last synced: 17 Jan 2025
https://github.com/chukwuemekaaham/data-engineering-zoomcamp
Datatalks Club Free Data Engineering Zoomcamp Project
bigquery dbt docker-compose duckdb gcp gcp-cloud-storage github-actions jupyter-notebook kafka linux looker-studio mageai pandas postgresql prefect python redpanda risingwave spark terraform
Last synced: 17 Jan 2025
https://github.com/kingyiusuen/udacity-data-engineering-nanodegree
Projects for Udacity's Data Engineering Nanodegree
airflow aws aws-athena aws-glue aws-redshift aws-s3 cassandra data-engineering spark
Last synced: 23 Jan 2025
https://github.com/antonio-f/big-data-analysis-with-scala-and-spark
Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).
big-data bigdata coursera data-analysis scala spark
Last synced: 06 Feb 2025
https://github.com/talmago/pyspark-loglikelihood
PySpark Loglikelihood Similarity Examples
mahout pyspark recommendation-engine spark
Last synced: 03 Feb 2025
https://github.com/tuancamtbtx/spark-build-tool
Generate Spark Job From This Tool
Last synced: 13 Feb 2025
https://github.com/renardeinside/databricks-jobs-jsonnet
Example project with Databricks jobs and configuration management via jsonnet
Last synced: 06 Feb 2025
https://github.com/shayartt/streaming-orders
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS
databricks etl kafka python spark spark-streaming
Last synced: 13 Feb 2025
https://github.com/benitomartin/de-hotel-reviews
Data Engineering Hotel Reviews
cicd data-engineering dbt gcp jupyter-notebook looker prefect python spark sql terraform
Last synced: 31 Dec 2024
https://github.com/sunsided/spark-atlas
Spark vs. MongoDB Atlas
data-processing docker jupyter-notebook mongodb mongodb-atlas pyspark python spark
Last synced: 13 Feb 2025
https://github.com/imvision12/real-time-tracking
Real time bus tracking using MTA bus API
flask hadoop javascript leaflet python spark
Last synced: 08 Feb 2025
https://github.com/cn-docker/spark-master
Spark Master Docker Image
docker-image spark spark-master
Last synced: 27 Jan 2025
https://github.com/msmenegol/datapark
Datapark: a self-hosted data platform
airflow data data-engineering data-science jupyter-notebook machine-learning minio mlflow postgresql spark
Last synced: 06 Feb 2025
https://github.com/nwtgck/spark-wikipedia-dump-loader
Wikipedia Dump Loader for Spark
Last synced: 06 Feb 2025
https://github.com/kayvansol/sparkonkubernetes
Spark On Kubernetes via helm chart
apache apache-spark bitnami docker-compose helm-charts java kubernetes pyspark python scala spark
Last synced: 17 Jan 2025
https://github.com/stabrise/scaledp
ScaleDP is an Open-Source extension of Apache Spark for Document Processing
doctrocr easyocr huggingface-models machine-learning nlp nlp-machine-learning ocr ocr-python ocr-recognition pdf pdf-document-processor spark suryaocr
Last synced: 03 Dec 2024
https://github.com/neshkeev/spark-graphs-demo
Distributed Graphs Processing with Apache Spark
apache-spark distributed-computing graph graph-algorithms graphlab graphs pregel spark
Last synced: 31 Jan 2025
https://github.com/nhsdigital/mps_diagnostics
Interpretable metadata for the results of NHS England record linkage
data-linkage data-science nhs-digital nhs-england pyspark record-linkage spark
Last synced: 23 Dec 2024
https://github.com/starhe/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成
clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks
Last synced: 13 Feb 2025
https://github.com/pomadchin/geotiff-layer
GeoTrellis GeoTiff layer demo
aws-lambda cog geotiff-layer geotrellis geotrellis-tile-server gis spark tiff
Last synced: 17 Jan 2025
https://github.com/zkan/machine-learning-with-spark-and-zeppelin
Machine Learning with Apache Spark & Zeppelin
Last synced: 12 Feb 2025
https://github.com/wgierke/distributed_data_analytics
Solutions for the hands-on sessions of the course "Distributed Data Analytics" at Hasso-Plattner-Institute using Akka and Spark.
akka data-analytics distributed inclusion-dependency spark
Last synced: 09 Feb 2025
https://github.com/jimthompson5802/datascience_containers
Personal docker images for various data science software stacks
data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark
Last synced: 13 Feb 2025
https://github.com/darule0/yarndiff
A rudimentary command line utility for contrasting Apache Yarn container logs.
diff difference diffing hadoop hadoop-mapreduce hive log4j mapreduce pig spark yarn yarn2
Last synced: 23 Dec 2024
https://github.com/nashtech-labs/spark-on-mesos
deployment mesos spark word-count
Last synced: 23 Dec 2024
https://github.com/casassg/thesis
Undergraduate final thesis: Big Data Analytics on Container Orchestrated Systems
casassg-thesis cassandra docker kubernetes latex spark thesis zeppelin
Last synced: 17 Dec 2024
https://github.com/darule0/sparkdiff
A rudimentary command line utility for contrasting Apache Spark event logs.
apache-spark compare-files diff difference diffing spark spark-sql spark-streaming sparksql
Last synced: 06 Feb 2025
https://github.com/opt-nc/opt-temps-attente-agences-camel
Pull datas from opt-temps-attente-agences-api and store data in various systems
camel datascience dataviz glia innovation kafka opensearch relation-client spark
Last synced: 12 Dec 2024
https://github.com/luisfalva/ophelia
Ophelian On Mars! More than a simple framework.
dask dataframe ophelia ophelia-spark rdd spark spark-ml spark-mllib spark-streaming
Last synced: 17 Dec 2024
https://github.com/bomada/sparkify
This project is the final Capstone project of the Udacity Data Scientist Nanodegree program. The aim is to learn how to manipulate realistic datasets with Spark to engineer relevant features for predicting churn. Input data is related to the fictive music streaming service Sparkify (similar to Spotify and Pandora).
churn ml music portfolio python spark streaming
Last synced: 09 Feb 2025
https://github.com/mxagar/spark_big_data_guide
This repository contains my personal guide on Spark and topics related to Big Data.
big-data hadoop machine-learning spark
Last synced: 23 Dec 2024
https://github.com/neo4j-field/end-to-end-fraud-demo
An example of how to load the data backing Zach's awesome Fraud Demo
Last synced: 23 Dec 2024
https://github.com/adelin-info/tp_datacloud
Architecture et développement des systèmes distribuées à large echelle
hadoop java map-reduce scala spark yarn zookeeper
Last synced: 30 Jan 2025
https://github.com/melezhik/sparrowdo-spark
Quick Spark Installer for CentOS and Docker
Last synced: 23 Dec 2024
https://github.com/ev2900/emr_studio_stock_price_demo
Demo EMR Studio notebook using PySpark to explore Stock Price Data
Last synced: 23 Dec 2024
https://github.com/ev2900/glue_spark_history_server
Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs
aws glue spark spark-history-server spark-ui
Last synced: 23 Dec 2024
https://github.com/chrispyl/learning-latent-representations-for-nitrogen-response-rate-prediction
Implementation for the paper 'Learning latent representations for operational nitrogen response rate prediction'
Last synced: 17 Jan 2025
https://github.com/vicnesterenko/apache-spark-labs
Base programs with datasets
apache-spark kpi-fict kpi-ua spark
Last synced: 10 Jan 2025
https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer
Kafka consumer example based on spark streaming with message formatting to spark dataframe
kafka kafka-consumer scala spark spark-streaming
Last synced: 07 Feb 2025