Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-13 00:28:14 UTC
- JSON Representation
https://github.com/colemurray/movie-rec-tutorial
cloud-infrastructure google-dataproc machine-learning spark tutorial
Last synced: 13 Dec 2024
https://github.com/anicolaspp/mapr-data-gen
Data generator for MapR Data Platform
data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark
Last synced: 16 Jan 2025
https://github.com/guiferviz/tuberia
Data engineering meets software engineering
data data-engineering expectations pipeline python spark
Last synced: 13 Feb 2025
https://github.com/julienpeloton/mini_spark_broker
Design and proof-of-concept for a Broker for astronomy using Apache Spark
docker kafka python spark spark-structured-streaming
Last synced: 11 Oct 2024
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/majobasgall/smote-mr
SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
big-data imbalanced-data machile-learning scala smote spark
Last synced: 08 Jan 2025
https://github.com/adrigrillo/nycsparktaxi
Apache Spark application to get the top ten frequent routes and profitable areas
big-data nyc parquet-files python spark taxi
Last synced: 10 Feb 2025
https://github.com/yandex-cloud/yc-delta
Delta Lake для Yandex Data Processing
delta delta-lake deltalake spark yandex-cloud
Last synced: 11 Nov 2024
https://github.com/gabfr/truck-data-wrangler
ELT (Extract, Load, Transform) process of accelerometer/gyroscope events with Apache Spark (w/ Structured Streaming) and TimescaleDB
data-classification spark stream timescaledb
Last synced: 07 Dec 2024
https://github.com/lgautier/pragmatic-polyglot-data-analysis
Docker container for off-the-shelf jupyter notebook + Python + R + Spark/pyspark + LLVM
docker-container jupyter-notebook python r spark
Last synced: 10 Nov 2024
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 19 Jan 2025
https://github.com/easonlai/databricks_odbc_connection_to_azure_sql_db_with_azure_ad_user_access_token
Making ODBC connection from Databricks (Azure Databricks) to Azure SQL Database with Azure AD User Access Token.
azure azuread azuredatabricks azuresql azuresqldb bigdata data-analysis dataanalysis dataanalytics databricks databricks-notebooks datascience microsoft microsoft-azure microsoftazure odbc odbc-driver pandas pyodbc spark
Last synced: 08 Jan 2025
https://github.com/radeity/spark-proxy
push-based calculation for spark application
distributed-computing spark volunteer-computing
Last synced: 01 Feb 2025
https://github.com/sebastianruizm/spark-kafka-cassandra
Demo Spark Structured Streaming + Apache Kafka + Apache Cassandra
cassandra docker kafka spark structured-streaming
Last synced: 11 Nov 2024
https://github.com/adovasoft-rnd/ci-recharge
composer test
ci4 cli codeigniter codeigniter4 commandline controller db library make migration mode mysql php seeds spark sql
Last synced: 14 Oct 2024
https://github.com/jacopodl/spark
Low level network library :satellite: :zap:
c low-level network network-programming networking raw raw-data raw-sockets spark
Last synced: 31 Jan 2025
https://github.com/anicolaspp/maprdbconnector
An independent MapR-DB Connector for Apache Spark that fully utilizes MapR-DB secondary indexes
database-connector mapr mapr-db maprdb-spark ojai scala spark
Last synced: 16 Nov 2024
https://github.com/longshilin/spark-wordcount
spark wordcount example | build in Eclipse+Maven+Scala Project+Spark
helloworld maven scala scala-programming spark wordcount
Last synced: 10 Nov 2024
https://github.com/hdfgroup/hdf5-spark-connector
HDF5 Connector for Apache Spark
Last synced: 19 Dec 2024
https://github.com/ahmetfurkandemir/trendyol-data-engineering-technical-case-study
Trendyol Data Engineering Technical Case Study.
apache-spark case-study data-engineering debian docker maven scala spark trendyol trendyoltech ubuntu
Last synced: 17 Jan 2025
https://github.com/mrcolorr/supreme-pancake
Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.
big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one
Last synced: 13 Nov 2024
https://github.com/omarhimada/floyo-ml-scala
Distributed ML for eCommerce platforms (recommendations, churn prediction, segmentation) written in Scala, using Spark MLlib, Elasticsearch and AWS SDK
Last synced: 09 Feb 2025
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Nov 2024
https://github.com/chen0040/spark-opt-moea
Distributed Multi-Objective Evolutionary Computation Framework for Spark
moea multi-objective-optimization nsga-ii spark
Last synced: 09 Feb 2025
https://github.com/ashton-sidhu/sysmon-extract
Extract logs based off events from sysmon. Comes as a package, cli and ui.
data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting
Last synced: 09 Nov 2024
https://github.com/exacaster/markdown_frames
Markdown tables parsing to pySpark/Pandas DataFrames
Last synced: 11 Nov 2024
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025
https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale
Fighting Bots at Scale: Identifying Bottlenecks & Best Practice
Last synced: 25 Dec 2024
https://github.com/curycu/sparkstudy
example codes for spark sql data wrangling
Last synced: 05 Nov 2024
https://github.com/puharesource/simplemavenrepository
A simple self hosted maven repository solution, written in Kotlin using the SparkJava framework.
kotlin maven repository spark sparkjava
Last synced: 01 Jan 2025
https://github.com/abdelmajidlh/spark-functionality-repo
Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.
apache apachespark databricks databricks-notebooks pyspark python3 scala spark
Last synced: 05 Feb 2025
https://github.com/mobiletelesystems/spark-dialect-extension
Package extending the default dialect capabilities for Spark.
etl etl-components plugin-system spark
Last synced: 11 Oct 2024
https://github.com/chaokunyang/athena
A task scheduler for spark, flink, mapreduce, java, python, bash
flink hadoop mapreduce spark task-manager task-scheduler
Last synced: 19 Nov 2024
https://github.com/s8sg/spark-py-submit
A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)
cdh hdfs hdp python-library spark spark-clusters spark-job
Last synced: 01 Feb 2025
https://github.com/vanessaaleung/ad-ctr-prediction
Ads Click-Through-Rate Prediction
ctr deep-learning prediction python scikit-learn spark
Last synced: 08 Jan 2025
https://github.com/smsraj2001/stream-batch-processing-kafka-spark
A project which includes simulation of real time queries by kafka and performing stream and batch processing of the simulated queries by spark. Also, this follows lambda architecture, in which kafka is publisher and spark helps in subscribing
batch-processing kafka kafka-topics lambda-architecture mysql-database no-api pub-sub pyspark python3 realtime spark streaming ubuntu2204 zookeeper
Last synced: 01 Jan 2025
https://github.com/jlgarridol/tfm-fis-if
Big Data Architecture of queues for real time video processing
big-data docker kafka parkinsons-disease spark streaming streaming-video
Last synced: 13 Jan 2025
https://github.com/abronte/pysparkgateway
Connect to remote Spark clusters seamlessly.
apache-spark bigdata pyspark python spark
Last synced: 28 Oct 2024
https://github.com/dllllb/ml-pipelines-tutorial
SciKit-Learn vs Apache Spark pipelines
machine-learning scikit-learn spark
Last synced: 19 Jan 2025
https://github.com/sircamp/spark-pspectrum
P-spectrum embedding and sequence relaxation for NLP in Spark
big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum
Last synced: 20 Jan 2025
https://github.com/multivacplatform/multivac-kaggle-titanic
Simple example of Titanic competition by Spark 2.2
kaggle-competition machine-learning scala spark
Last synced: 12 Jan 2025
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 10 Jan 2025
https://github.com/spycsh/runspec
an android streaming running app with backend based on kafka+spark+mongodb
android iot kafka leafletjs mongodb restlet spark stompwebsocket ubiquitous-computing
Last synced: 05 Dec 2024
https://github.com/tayeva/satellite-kafka-spark-delta-lake-pipeline-example
Demo App - Satellite Produce Consumer App
cpp17 delta-lake docker docker-compose flatbuffers java kafka parquet scala spark spark-streaming
Last synced: 12 Feb 2025
https://github.com/bonigarcia/spark-examples
Collection of Spark examples using Python
cassandra influxdb kafka python spark spark-streaming
Last synced: 08 Feb 2025
https://github.com/tomwhite/disq-original
A library for manipulating bioinformatics sequencing formats in Apache Spark.
bioinformatics genomics ngs sequencing spark
Last synced: 10 Feb 2025
https://github.com/imlegend19/vidspark
VidSpark is a prototype video CMS backend system powered by spark and elasticsearch
celery elasticsearch python redis scala spark
Last synced: 14 Jan 2025
https://github.com/wittline/livyc
Apache Spark as a Service with Apache Livy Client
apache-livy apache-spark big-data data-engineering dataengineering docker livy-client livy-docker pyhton spark
Last synced: 14 Oct 2024
https://github.com/dustin-decker/lognom
Simple script for processing streaming data from Redis using Apache Spark
elasticsearch kafka redis spark
Last synced: 29 Jan 2025
https://github.com/aromoh/keras-distributed-streaming
Distributed Keras model for making predictions of sentiment from Spanish sentences in stream context using Spark Streaming and Apache Kafka
cnn-keras kafka keras keras-tensorflow pyspark-notebook sentiment-analysis spark spark-streaming
Last synced: 03 Feb 2025
https://github.com/felipekunzler/spark-twitter-analysis
Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.
Last synced: 30 Oct 2024
https://github.com/wlongxiang/pyspark_docker
Run pyspark cluster with docker on your local laptop
docker docker-compose pyspark pyspark-docker pyspark-tutorial spark
Last synced: 17 Dec 2024
https://github.com/neo4j-field/bigquery-connector
Bi-directional connectivity between Google BigQuery and Neo4j AuraDS
arrow-flight bigquery neo4j protobuf python spark
Last synced: 23 Dec 2024
https://github.com/highoncarbs/lumberjack
:pick: Search and analyse your logs efficienlty with Lumberjack
analysis flask log logging python spark web-dashboard
Last synced: 14 Oct 2024
https://github.com/timvisee/hhs-p7-movie-recommendation-engine
:movie_camera: Big data project for college (HHS) period 7
algorithm hadoop recommendation-engine spark
Last synced: 15 Jan 2025
https://github.com/vivek-bombatkar/dataworkssummit2018_spark_ml
hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud.
apache-spark linear-regression machine-learning spark workshop
Last synced: 08 Nov 2024
https://github.com/ahmetfurkandemir/hepsiburada-data-engineering-project
Hepsiburada Data Engineering Project
Last synced: 17 Jan 2025
https://github.com/enoy19/keyboard-light-composer-mc-connector
Minecraft Forge Mod to access stats in Minecraft within the Keyboard Light Composer (https://github.com/enoy19/keyboard-light-composer)
composer forge g910 keyboard light logitech minecraft mod orion rgb spark spectrum
Last synced: 17 Jan 2025
https://github.com/fiqryq/sparkar-pekerjaan-impian
Instagram Filter Using Spark AR
augmented-reality-applications facebook instagram spark
Last synced: 26 Jan 2025
https://github.com/ichowdhury01/match
A social networking platform that allows users to find friends with similar interests in their area.
geolocation-api jdbc maven mysql pbkdf2 spark
Last synced: 06 Feb 2025
https://github.com/ashishgopalhattimare/parallel-concurrent-and-distributed-programming-in-java
Parallel, Concurrent, and Distributed Programming in Java | Coursera
block-isolation boruvka-algorithm concurrent-programming critical-section distributed-programming java-8 kafka locks mapreduce-java mpi parallel-programming rice-university spark synchronization threads
Last synced: 21 Jan 2025
https://github.com/aveek-saha/cricket-score-predictor
A Big data application to predict the outcome of a T20 cricket match.
big-data big-data-analytics clustering pyspark spark spark-mllib
Last synced: 24 Dec 2024
https://github.com/ugurcanerdogan/machine-learning-with-spark
BBM469*ASG3 - Machine Learning with Spark
apache-spark data-science machine-learning spark
Last synced: 12 Feb 2025
https://github.com/cwienberg/spark-sorting-helpers
Helper library for using secondary sorting in Spark RDD and Dataset operations
Last synced: 23 Jan 2025
https://github.com/stefen-taime/investissement
Jenkins Delta pipeline
delta-lake jenkins-pipeline minio spark
Last synced: 23 Jan 2025
https://github.com/j-sephb-lt-n/useful-code-snippets
A searchable collection of useful little pieces of code
aws bash cloud compute-engine docker dockerfile ec2 gcp graph pyspark python r-language shell spark virtual-machine
Last synced: 28 Dec 2024
https://github.com/bluejoe2008/hippo-rpc
Hippo Transport Library enhances spark-commons with easy stream management & handling
Last synced: 10 Feb 2025
https://github.com/lmouhib/auto-register-spark-ui-k8s
A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes
spark spark-kubernetes spark-sql spark-streaming spark-ui
Last synced: 10 Feb 2025
https://github.com/pedropark99/introd-pyspark
An open and introductory book for the Python API of Apache Spark (pyspark)
Last synced: 14 Oct 2024
https://github.com/alexioannides/py-readme-snippets
This repository contains snippits of writing (in Markdown), on various topics relating to various flavours of Python development project.
Last synced: 17 Jan 2025
https://github.com/hb-chen/spark-elasticsearch-recommender
Zeppelin-v0.8.0 Notebook演示使用Spark -v2.3.2+ Elasticsearch-v6.3.2构建推荐系统
elasticsearch recommender spark zeppelin
Last synced: 08 Jan 2025
https://github.com/dimajix/docker-spark
Repository for building Docker containers for Spark
Last synced: 05 Jan 2025
https://github.com/angelcervera/poc-drivingdistance
Proof of concept to implement a service to calculate the driving distance using osm network
akka openstreetmap osm osm4scala scala spark
Last synced: 10 Feb 2025
https://github.com/thanaraklee/dataflow-with-gcp
This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.
airflow apache-spark data-engineering etl pandas spark
Last synced: 25 Dec 2024
https://github.com/gaelfoppolo/self-service-data-analytics
Data analysis made for business users
aws big-data data-analytics hadoop spark
Last synced: 03 Feb 2025
https://github.com/tranthe170/nyc-taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 17 Jan 2025
https://github.com/badoo/hadoop-xargs
Util to run heterogenous applications on Hadoop synchronously
Last synced: 12 Nov 2024
https://github.com/vasnake/spark.ml.spatialjointransformer
spark.ml.transformer: join two datasets using spatial relations
geospatial join ml-pipeline python scala spark spark-ml spatial transformer
Last synced: 03 Jan 2025
https://github.com/wtanaka/ansible-role-apache-spark
Ansible role to install Apache Spark
ansible ansible-galaxy ansible-role ansible-roles apache-spark batch galaxy mapreduce spark streaming
Last synced: 23 Jan 2025
https://github.com/pankajsingh09/data_engineering_using_aws
This Repository contains the contents related to Data Engineering Using AWS
aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark
Last synced: 12 Feb 2025
https://github.com/earthquakesan/twittertrends
Twitter Trends is a Spark Streaming example application
Last synced: 17 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/spratiher9/valido
PySpark ⚡ dataframe workflow ⚒ validator
apache apache-spark bigdata databricks decorators pyspark python3 spark testing
Last synced: 01 Feb 2025
https://github.com/akarce/udacity-data-pipeline-with-airflow
Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.
airflow minio postgresql pyspark spark
Last synced: 12 Oct 2024
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 11 Oct 2024
https://github.com/fpopic/gg-interview-challenge
(Interview) GG Interview Challenge in Scala/Spark
apache-spark json logstash parsing regex scala spark sparksql
Last synced: 10 Jan 2025
https://github.com/alvarogarcia7/bank-kata-kotlin
Bank pet project, in kotlin. See interests as topics
api-first api-standard bank-kata blackbox-testing etude finite-state-machine gradle gradlew hateoas junit junit5 kata kotlin multimodule paypal-rest-api practice spark sparkjava trikitrok with-client
Last synced: 10 Jan 2025
https://github.com/jldbc/big-data
Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark
big-data hadoop pagerank recommender-system spark
Last synced: 04 Jan 2025
https://github.com/hussaintaj-w/spark_submit_project
An easy to use script that automatically adds files to the spark-submit command.
Last synced: 23 Jan 2025