Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-13 00:28:14 UTC
- JSON Representation
https://github.com/yandex-cloud/yc-delta
Delta Lake для Yandex Data Processing
delta delta-lake deltalake spark yandex-cloud
Last synced: 11 Nov 2024
https://github.com/sircamp/spark-pspectrum
P-spectrum embedding and sequence relaxation for NLP in Spark
big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum
Last synced: 20 Jan 2025
https://github.com/exacaster/markdown_frames
Markdown tables parsing to pySpark/Pandas DataFrames
Last synced: 11 Nov 2024
https://github.com/colemurray/movie-rec-tutorial
cloud-infrastructure google-dataproc machine-learning spark tutorial
Last synced: 13 Dec 2024
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025
https://github.com/tayeva/satellite-kafka-spark-delta-lake-pipeline-example
Demo App - Satellite Produce Consumer App
cpp17 delta-lake docker docker-compose flatbuffers java kafka parquet scala spark spark-streaming
Last synced: 12 Feb 2025
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Nov 2024
https://github.com/lgautier/pragmatic-polyglot-data-analysis
Docker container for off-the-shelf jupyter notebook + Python + R + Spark/pyspark + LLVM
docker-container jupyter-notebook python r spark
Last synced: 10 Nov 2024
https://github.com/vanessaaleung/ad-ctr-prediction
Ads Click-Through-Rate Prediction
ctr deep-learning prediction python scikit-learn spark
Last synced: 08 Jan 2025
https://github.com/wlongxiang/pyspark_docker
Run pyspark cluster with docker on your local laptop
docker docker-compose pyspark pyspark-docker pyspark-tutorial spark
Last synced: 17 Dec 2024
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/mobiletelesystems/spark-dialect-extension
Package extending the default dialect capabilities for Spark.
etl etl-components plugin-system spark
Last synced: 11 Oct 2024
https://github.com/easonlai/databricks_odbc_connection_to_azure_sql_db_with_azure_ad_user_access_token
Making ODBC connection from Databricks (Azure Databricks) to Azure SQL Database with Azure AD User Access Token.
azure azuread azuredatabricks azuresql azuresqldb bigdata data-analysis dataanalysis dataanalytics databricks databricks-notebooks datascience microsoft microsoft-azure microsoftazure odbc odbc-driver pandas pyodbc spark
Last synced: 08 Jan 2025
https://github.com/wittline/livyc
Apache Spark as a Service with Apache Livy Client
apache-livy apache-spark big-data data-engineering dataengineering docker livy-client livy-docker pyhton spark
Last synced: 14 Oct 2024
https://github.com/chen0040/spark-opt-moea
Distributed Multi-Objective Evolutionary Computation Framework for Spark
moea multi-objective-optimization nsga-ii spark
Last synced: 09 Feb 2025
https://github.com/vivek-bombatkar/dataworkssummit2018_spark_ml
hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud.
apache-spark linear-regression machine-learning spark workshop
Last synced: 08 Nov 2024
https://github.com/longshilin/spark-wordcount
spark wordcount example | build in Eclipse+Maven+Scala Project+Spark
helloworld maven scala scala-programming spark wordcount
Last synced: 10 Nov 2024
https://github.com/neo4j-field/bigquery-connector
Bi-directional connectivity between Google BigQuery and Neo4j AuraDS
arrow-flight bigquery neo4j protobuf python spark
Last synced: 23 Dec 2024
https://github.com/dustin-decker/lognom
Simple script for processing streaming data from Redis using Apache Spark
elasticsearch kafka redis spark
Last synced: 29 Jan 2025
https://github.com/jlgarridol/tfm-fis-if
Big Data Architecture of queues for real time video processing
big-data docker kafka parkinsons-disease spark streaming streaming-video
Last synced: 13 Jan 2025
https://github.com/multivacplatform/multivac-kaggle-titanic
Simple example of Titanic competition by Spark 2.2
kaggle-competition machine-learning scala spark
Last synced: 12 Jan 2025
https://github.com/majobasgall/smote-mr
SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
big-data imbalanced-data machile-learning scala smote spark
Last synced: 08 Jan 2025
https://github.com/puharesource/simplemavenrepository
A simple self hosted maven repository solution, written in Kotlin using the SparkJava framework.
kotlin maven repository spark sparkjava
Last synced: 01 Jan 2025
https://github.com/abronte/pysparkgateway
Connect to remote Spark clusters seamlessly.
apache-spark bigdata pyspark python spark
Last synced: 28 Oct 2024
https://github.com/sebastianruizm/spark-kafka-cassandra
Demo Spark Structured Streaming + Apache Kafka + Apache Cassandra
cassandra docker kafka spark structured-streaming
Last synced: 11 Nov 2024
https://github.com/highoncarbs/lumberjack
:pick: Search and analyse your logs efficienlty with Lumberjack
analysis flask log logging python spark web-dashboard
Last synced: 14 Oct 2024
https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale
Fighting Bots at Scale: Identifying Bottlenecks & Best Practice
Last synced: 25 Dec 2024
https://github.com/ashton-sidhu/sysmon-extract
Extract logs based off events from sysmon. Comes as a package, cli and ui.
data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting
Last synced: 09 Nov 2024
https://github.com/abdelmajidlh/spark-functionality-repo
Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.
apache apachespark databricks databricks-notebooks pyspark python3 scala spark
Last synced: 05 Feb 2025
https://github.com/s8sg/spark-py-submit
A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)
cdh hdfs hdp python-library spark spark-clusters spark-job
Last synced: 01 Feb 2025
https://github.com/radeity/spark-proxy
push-based calculation for spark application
distributed-computing spark volunteer-computing
Last synced: 01 Feb 2025
https://github.com/spycsh/runspec
an android streaming running app with backend based on kafka+spark+mongodb
android iot kafka leafletjs mongodb restlet spark stompwebsocket ubiquitous-computing
Last synced: 05 Dec 2024
https://github.com/bonigarcia/spark-examples
Collection of Spark examples using Python
cassandra influxdb kafka python spark spark-streaming
Last synced: 08 Feb 2025
https://github.com/smsraj2001/stream-batch-processing-kafka-spark
A project which includes simulation of real time queries by kafka and performing stream and batch processing of the simulated queries by spark. Also, this follows lambda architecture, in which kafka is publisher and spark helps in subscribing
batch-processing kafka kafka-topics lambda-architecture mysql-database no-api pub-sub pyspark python3 realtime spark streaming ubuntu2204 zookeeper
Last synced: 01 Jan 2025
https://github.com/gabfr/truck-data-wrangler
ELT (Extract, Load, Transform) process of accelerometer/gyroscope events with Apache Spark (w/ Structured Streaming) and TimescaleDB
data-classification spark stream timescaledb
Last synced: 07 Dec 2024
https://github.com/adrigrillo/nycsparktaxi
Apache Spark application to get the top ten frequent routes and profitable areas
big-data nyc parquet-files python spark taxi
Last synced: 10 Feb 2025
https://github.com/felipekunzler/spark-twitter-analysis
Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.
Last synced: 30 Oct 2024
https://github.com/chaokunyang/athena
A task scheduler for spark, flink, mapreduce, java, python, bash
flink hadoop mapreduce spark task-manager task-scheduler
Last synced: 19 Nov 2024
https://github.com/timvisee/hhs-p7-movie-recommendation-engine
:movie_camera: Big data project for college (HHS) period 7
algorithm hadoop recommendation-engine spark
Last synced: 15 Jan 2025
https://github.com/omarhimada/floyo-ml-scala
Distributed ML for eCommerce platforms (recommendations, churn prediction, segmentation) written in Scala, using Spark MLlib, Elasticsearch and AWS SDK
Last synced: 09 Feb 2025
https://github.com/guiferviz/tuberia
Data engineering meets software engineering
data data-engineering expectations pipeline python spark
Last synced: 13 Feb 2025
https://github.com/fiqryq/sparkar-pekerjaan-impian
Instagram Filter Using Spark AR
augmented-reality-applications facebook instagram spark
Last synced: 26 Jan 2025
https://github.com/fiqryq/spark-minimal-gray
🥰 Simple Instagram Filter Using Spark Ar studio by Facebook.
Last synced: 26 Jan 2025
https://github.com/mrcolorr/supreme-pancake
Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.
big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one
Last synced: 13 Nov 2024
https://github.com/ahmetfurkandemir/hepsiburada-data-engineering-project
Hepsiburada Data Engineering Project
Last synced: 17 Jan 2025
https://github.com/adovasoft-rnd/ci-recharge
composer test
ci4 cli codeigniter codeigniter4 commandline controller db library make migration mode mysql php seeds spark sql
Last synced: 14 Oct 2024
https://github.com/julienpeloton/mini_spark_broker
Design and proof-of-concept for a Broker for astronomy using Apache Spark
docker kafka python spark spark-structured-streaming
Last synced: 11 Oct 2024
https://github.com/hdfgroup/hdf5-spark-connector
HDF5 Connector for Apache Spark
Last synced: 19 Dec 2024
https://github.com/enoy19/keyboard-light-composer-mc-connector
Minecraft Forge Mod to access stats in Minecraft within the Keyboard Light Composer (https://github.com/enoy19/keyboard-light-composer)
composer forge g910 keyboard light logitech minecraft mod orion rgb spark spectrum
Last synced: 17 Jan 2025
https://github.com/dllllb/ml-pipelines-tutorial
SciKit-Learn vs Apache Spark pipelines
machine-learning scikit-learn spark
Last synced: 19 Jan 2025
https://github.com/tomwhite/disq-original
A library for manipulating bioinformatics sequencing formats in Apache Spark.
bioinformatics genomics ngs sequencing spark
Last synced: 10 Feb 2025
https://github.com/curycu/sparkstudy
example codes for spark sql data wrangling
Last synced: 05 Nov 2024
https://github.com/anicolaspp/maprdbconnector
An independent MapR-DB Connector for Apache Spark that fully utilizes MapR-DB secondary indexes
database-connector mapr mapr-db maprdb-spark ojai scala spark
Last synced: 16 Nov 2024
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 10 Jan 2025
https://github.com/ahmetfurkandemir/trendyol-data-engineering-technical-case-study
Trendyol Data Engineering Technical Case Study.
apache-spark case-study data-engineering debian docker maven scala spark trendyol trendyoltech ubuntu
Last synced: 17 Jan 2025
https://github.com/aromoh/keras-distributed-streaming
Distributed Keras model for making predictions of sentiment from Spanish sentences in stream context using Spark Streaming and Apache Kafka
cnn-keras kafka keras keras-tensorflow pyspark-notebook sentiment-analysis spark spark-streaming
Last synced: 03 Feb 2025
https://github.com/jacopodl/spark
Low level network library :satellite: :zap:
c low-level network network-programming networking raw raw-data raw-sockets spark
Last synced: 31 Jan 2025
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 19 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 04 Jan 2025
https://github.com/dimajix/docker-spark
Repository for building Docker containers for Spark
Last synced: 05 Jan 2025
https://mgrojo.github.io/adasearch/
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 28 Nov 2024
https://github.com/lepetitbloc/sparksd
:sparkler: Sparks wallet Docker container
cryptocurrency dockerfile masternode spark wallet
Last synced: 26 Jan 2025
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 11 Oct 2024
https://github.com/dvelkow/real_time_bulgarian_news_aggregator
An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.
Last synced: 12 Oct 2024
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/gunantos/php-spark
PHP Server Develop
php server serverless-framework spark
Last synced: 13 Jan 2025
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 09 Feb 2025
https://github.com/boazmohar/pysparkutils
A collection of utilities for handling pySpark's SparkContext
Last synced: 09 Feb 2025
https://github.com/officiallysingh/spring-boot-starter-spark
Spark Spring Boot starter
spark spring spring-boot springboot starter starters
Last synced: 25 Dec 2024
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/tpvasconcelos/sparkypandy
It's not spark, it's not pandas, it's just awkward...
dataframe pandas pyspark spark
Last synced: 05 Nov 2024
https://github.com/r13i/spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
big-data deduplication record-linkage records-management scala spark
Last synced: 02 Feb 2025
https://github.com/omar-besbes/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.
batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn
Last synced: 20 Jan 2025
https://github.com/triandicAnt/TwitterSentimentAnalytics
Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.
machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics
Last synced: 23 Oct 2024
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/lucivpav/dnbc-scala
Parallel implementation of dynamic naive Bayesian classifier
apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark
Last synced: 12 Feb 2025
https://github.com/j-sephb-lt-n/useful-code-snippets
A searchable collection of useful little pieces of code
aws bash cloud compute-engine docker dockerfile ec2 gcp graph pyspark python r-language shell spark virtual-machine
Last synced: 28 Dec 2024
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/erikerlandson/pyspark-ubi
Minimalist install of pyspark on top of Red Hat UBI
container-image pyspark spark ubi
Last synced: 06 Jan 2025
https://github.com/brunneis/minebench
Proof-of-Work based benchmark written in Python that works with real Bitcoin data
benchmark bitcoin mining proof-of-work spark
Last synced: 26 Jan 2025
https://github.com/longi94/lsde2017-p3-flight-visualization
Animated interactive flight visualization
Last synced: 06 Jan 2025
https://github.com/fdmsantos/aws-twitter-data-analytics
Project to Learn Data analytics in AWS using twitter data
aws data-analytics data-engineering data-science data-visualization flink spark terraform
Last synced: 26 Jan 2025
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 11 Oct 2024