Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-11 00:28:31 UTC
- JSON Representation
https://github.com/mobiletelesystems/spark-dialect-extension
Package extending the default dialect capabilities for Spark.
etl etl-components plugin-system spark
Last synced: 11 Oct 2024
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Nov 2024
https://github.com/anicolaspp/maprdbconnector
An independent MapR-DB Connector for Apache Spark that fully utilizes MapR-DB secondary indexes
database-connector mapr mapr-db maprdb-spark ojai scala spark
Last synced: 16 Nov 2024
https://github.com/tayeva/satellite-kafka-spark-delta-lake-pipeline-example
Demo App - Satellite Produce Consumer App
cpp17 delta-lake docker docker-compose flatbuffers java kafka parquet scala spark spark-streaming
Last synced: 02 Jan 2025
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 10 Jan 2025
https://github.com/omarhimada/floyo-ml-scala
Distributed ML for eCommerce platforms (recommendations, churn prediction, segmentation) written in Scala, using Spark MLlib, Elasticsearch and AWS SDK
Last synced: 09 Feb 2025
https://github.com/longshilin/spark-wordcount
spark wordcount example | build in Eclipse+Maven+Scala Project+Spark
helloworld maven scala scala-programming spark wordcount
Last synced: 10 Nov 2024
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025
https://github.com/ahmetfurkandemir/hepsiburada-data-engineering-project
Hepsiburada Data Engineering Project
Last synced: 17 Jan 2025
https://github.com/abdelmajidlh/spark-functionality-repo
Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.
apache apachespark databricks databricks-notebooks pyspark python3 scala spark
Last synced: 05 Feb 2025
https://github.com/curycu/sparkstudy
example codes for spark sql data wrangling
Last synced: 05 Nov 2024
https://github.com/multivacplatform/multivac-kaggle-titanic
Simple example of Titanic competition by Spark 2.2
kaggle-competition machine-learning scala spark
Last synced: 12 Jan 2025
https://github.com/highoncarbs/lumberjack
:pick: Search and analyse your logs efficienlty with Lumberjack
analysis flask log logging python spark web-dashboard
Last synced: 14 Oct 2024
https://github.com/fiqryq/sparkar-pekerjaan-impian
Instagram Filter Using Spark AR
augmented-reality-applications facebook instagram spark
Last synced: 26 Jan 2025
https://github.com/vanessaaleung/ad-ctr-prediction
Ads Click-Through-Rate Prediction
ctr deep-learning prediction python scikit-learn spark
Last synced: 08 Jan 2025
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 19 Jan 2025
https://github.com/aromoh/keras-distributed-streaming
Distributed Keras model for making predictions of sentiment from Spanish sentences in stream context using Spark Streaming and Apache Kafka
cnn-keras kafka keras keras-tensorflow pyspark-notebook sentiment-analysis spark spark-streaming
Last synced: 03 Feb 2025
https://github.com/fiqryq/spark-minimal-gray
🥰 Simple Instagram Filter Using Spark Ar studio by Facebook.
Last synced: 26 Jan 2025
https://github.com/dllllb/ml-pipelines-tutorial
SciKit-Learn vs Apache Spark pipelines
machine-learning scikit-learn spark
Last synced: 19 Jan 2025
https://github.com/exacaster/markdown_frames
Markdown tables parsing to pySpark/Pandas DataFrames
Last synced: 11 Nov 2024
https://github.com/imlegend19/vidspark
VidSpark is a prototype video CMS backend system powered by spark and elasticsearch
celery elasticsearch python redis scala spark
Last synced: 14 Jan 2025
https://github.com/yandex-cloud/yc-delta
Delta Lake для Yandex Data Processing
delta delta-lake deltalake spark yandex-cloud
Last synced: 11 Nov 2024
https://github.com/adovasoft-rnd/ci-recharge
composer test
ci4 cli codeigniter codeigniter4 commandline controller db library make migration mode mysql php seeds spark sql
Last synced: 14 Oct 2024
https://github.com/jlgarridol/tfm-fis-if
Big Data Architecture of queues for real time video processing
big-data docker kafka parkinsons-disease spark streaming streaming-video
Last synced: 13 Jan 2025
https://github.com/jacopodl/spark
Low level network library :satellite: :zap:
c low-level network network-programming networking raw raw-data raw-sockets spark
Last synced: 31 Jan 2025
https://github.com/wittline/livyc
Apache Spark as a Service with Apache Livy Client
apache-livy apache-spark big-data data-engineering dataengineering docker livy-client livy-docker pyhton spark
Last synced: 14 Oct 2024
https://github.com/s8sg/spark-py-submit
A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)
cdh hdfs hdp python-library spark spark-clusters spark-job
Last synced: 01 Feb 2025
https://github.com/sebastianruizm/spark-kafka-cassandra
Demo Spark Structured Streaming + Apache Kafka + Apache Cassandra
cassandra docker kafka spark structured-streaming
Last synced: 11 Nov 2024
https://github.com/timvisee/hhs-p7-movie-recommendation-engine
:movie_camera: Big data project for college (HHS) period 7
algorithm hadoop recommendation-engine spark
Last synced: 15 Jan 2025
https://github.com/tomwhite/disq-original
A library for manipulating bioinformatics sequencing formats in Apache Spark.
bioinformatics genomics ngs sequencing spark
Last synced: 10 Feb 2025
https://github.com/chaokunyang/athena
A task scheduler for spark, flink, mapreduce, java, python, bash
flink hadoop mapreduce spark task-manager task-scheduler
Last synced: 19 Nov 2024
https://github.com/chen0040/spark-opt-moea
Distributed Multi-Objective Evolutionary Computation Framework for Spark
moea multi-objective-optimization nsga-ii spark
Last synced: 09 Feb 2025
https://github.com/majobasgall/smote-mr
SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
big-data imbalanced-data machile-learning scala smote spark
Last synced: 08 Jan 2025
https://github.com/ashton-sidhu/sysmon-extract
Extract logs based off events from sysmon. Comes as a package, cli and ui.
data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting
Last synced: 09 Nov 2024
https://github.com/spycsh/runspec
an android streaming running app with backend based on kafka+spark+mongodb
android iot kafka leafletjs mongodb restlet spark stompwebsocket ubiquitous-computing
Last synced: 05 Dec 2024
https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale
Fighting Bots at Scale: Identifying Bottlenecks & Best Practice
Last synced: 25 Dec 2024
https://github.com/guiferviz/tuberia
Data engineering meets software engineering
data data-engineering expectations pipeline python spark
Last synced: 20 Dec 2024
https://github.com/ahmetfurkandemir/trendyol-data-engineering-technical-case-study
Trendyol Data Engineering Technical Case Study.
apache-spark case-study data-engineering debian docker maven scala spark trendyol trendyoltech ubuntu
Last synced: 17 Jan 2025
https://github.com/easonlai/databricks_odbc_connection_to_azure_sql_db_with_azure_ad_user_access_token
Making ODBC connection from Databricks (Azure Databricks) to Azure SQL Database with Azure AD User Access Token.
azure azuread azuredatabricks azuresql azuresqldb bigdata data-analysis dataanalysis dataanalytics databricks databricks-notebooks datascience microsoft microsoft-azure microsoftazure odbc odbc-driver pandas pyodbc spark
Last synced: 08 Jan 2025
https://github.com/adrigrillo/nycsparktaxi
Apache Spark application to get the top ten frequent routes and profitable areas
big-data nyc parquet-files python spark taxi
Last synced: 10 Feb 2025
https://github.com/bonigarcia/spark-examples
Collection of Spark examples using Python
cassandra influxdb kafka python spark spark-streaming
Last synced: 08 Feb 2025
https://github.com/radeity/spark-proxy
push-based calculation for spark application
distributed-computing spark volunteer-computing
Last synced: 01 Feb 2025
https://github.com/anicolaspp/mapr-data-gen
Data generator for MapR Data Platform
data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark
Last synced: 16 Jan 2025
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/sircamp/spark-pspectrum
P-spectrum embedding and sequence relaxation for NLP in Spark
big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum
Last synced: 20 Jan 2025
https://github.com/neo4j-field/bigquery-connector
Bi-directional connectivity between Google BigQuery and Neo4j AuraDS
arrow-flight bigquery neo4j protobuf python spark
Last synced: 23 Dec 2024
https://github.com/wlongxiang/pyspark_docker
Run pyspark cluster with docker on your local laptop
docker docker-compose pyspark pyspark-docker pyspark-tutorial spark
Last synced: 17 Dec 2024
https://github.com/colemurray/movie-rec-tutorial
cloud-infrastructure google-dataproc machine-learning spark tutorial
Last synced: 13 Dec 2024
https://github.com/hdfgroup/hdf5-spark-connector
HDF5 Connector for Apache Spark
Last synced: 19 Dec 2024
https://github.com/puharesource/simplemavenrepository
A simple self hosted maven repository solution, written in Kotlin using the SparkJava framework.
kotlin maven repository spark sparkjava
Last synced: 01 Jan 2025
https://github.com/vivek-bombatkar/dataworkssummit2018_spark_ml
hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud.
apache-spark linear-regression machine-learning spark workshop
Last synced: 08 Nov 2024
https://github.com/abronte/pysparkgateway
Connect to remote Spark clusters seamlessly.
apache-spark bigdata pyspark python spark
Last synced: 28 Oct 2024
https://github.com/lgautier/pragmatic-polyglot-data-analysis
Docker container for off-the-shelf jupyter notebook + Python + R + Spark/pyspark + LLVM
docker-container jupyter-notebook python r spark
Last synced: 10 Nov 2024
https://github.com/julienpeloton/mini_spark_broker
Design and proof-of-concept for a Broker for astronomy using Apache Spark
docker kafka python spark spark-structured-streaming
Last synced: 11 Oct 2024
https://github.com/enoy19/keyboard-light-composer-mc-connector
Minecraft Forge Mod to access stats in Minecraft within the Keyboard Light Composer (https://github.com/enoy19/keyboard-light-composer)
composer forge g910 keyboard light logitech minecraft mod orion rgb spark spectrum
Last synced: 17 Jan 2025
https://github.com/mrcolorr/supreme-pancake
Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.
big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one
Last synced: 13 Nov 2024
https://github.com/dustin-decker/lognom
Simple script for processing streaming data from Redis using Apache Spark
elasticsearch kafka redis spark
Last synced: 29 Jan 2025
https://github.com/afsalthaj/supaku-sukara
Functional Programming, Functional Programming Exercise Solutions in Scala & Spark
functional-programming functor language monad parallelism scala shapeless spark typeclasses
Last synced: 08 Jan 2025
https://github.com/oracle-quickstart/oci-spark
Terraform module to deploy Spark on Oracle Cloud Infrastructure (OCI)
cloud oci oracle oracle-led spark terraform
Last synced: 07 Nov 2024
https://github.com/maxinexiong/item-based-collaborative-filtering
This project utilizes PySpark DataFrames and PySpark RDD to implement item-based collaborative filtering. By calculating cosine similarity scores or identifying movies with the highest number of shared viewers, the system recommends 10 similar movies for a given target movie that aligns users’ preferences.
apache-spark collaborative-filtering movie-recommendation pyspark python spark spark-dataframes spark-rdd
Last synced: 21 Dec 2024
https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search
This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.
apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd
Last synced: 21 Dec 2024
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/apache/incubator-gluten-site
Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
Last synced: 04 Feb 2025
https://github.com/superruzafa/scala-spark-big-data
My solutions to the Coursera's Big Data Analysis with Scala and Spark course
Last synced: 30 Dec 2024
https://github.com/mcddhub/mcdd-big-data-study
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
big-data data-processing docker flink hadoop kafka spark zookeeper
Last synced: 09 Feb 2025
https://github.com/gmartinezramirez-old/data-science-portafolio
:notebook: [Active] Portafolio of data science projects. Using: Python, PyTorch, Spark, Tensorflow, Scikit, Keras. Includes Classification, Regression, Time series, NLP, Deep learning, among others.
data-science data-science-learning data-science-notebook data-science-portfolio hadoop jupyter-notebook keras notebook pandas pyspark python pytorch r sci-kit spark tensorflow
Last synced: 05 Dec 2024
https://github.com/boazmohar/pysparkutils
A collection of utilities for handling pySpark's SparkContext
Last synced: 09 Feb 2025
https://github.com/officiallysingh/spring-boot-starter-spark
Spark Spring Boot starter
spark spring spring-boot springboot starter starters
Last synced: 25 Dec 2024
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/garystafford/dataproc-java-demo
Demonstration of Google Cloud Dataproc for running Spark jobs with Java
big-data-analytics dataproc gcp google java spark
Last synced: 06 Dec 2024
https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit
Docker CI/CD Pipeline
apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark
Last synced: 14 Jan 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/yucl80/avrodemo
write , append avro to hdfs file
avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log
Last synced: 27 Jan 2025
https://github.com/dvelkow/real_time_bulgarian_news_aggregator
An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.
Last synced: 12 Oct 2024
https://github.com/conema/transe-pyspark
TransE implementation in Spark (pyspark)
aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings
Last synced: 21 Jan 2025
https://github.com/xpcosmos/injestao-dados-enem-sql
Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.
docker docker-compose postgresql pyspark python spark sql statistics
Last synced: 14 Jan 2025
https://github.com/jms0522/hadoop_system
✅ hadoop eco system을 구성하고 파이프라인 제작합니다.
Last synced: 11 Oct 2024
https://github.com/angelotc/MacroDAG
A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.
Last synced: 06 Nov 2024
https://github.com/yalishanda42/scala-recsys
Scala(-ble) recommender system architecture using functional programming (PoC)
cats cats-effect functional-programming movielens recommender-system recsys scala spark
Last synced: 28 Dec 2024
https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark
Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300
data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming
Last synced: 30 Oct 2024
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/hifly81/1brc_streaming
1brc challenge with streaming solutions for Apache Kafka
1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming
Last synced: 02 Nov 2024
https://github.com/mgrojo/adasearch
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 27 Oct 2024
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://github.com/tpvasconcelos/sparkypandy
It's not spark, it's not pandas, it's just awkward...
dataframe pandas pyspark spark
Last synced: 05 Nov 2024
https://github.com/makohn/lambda-architecture-poc
♨️ A PoC implementation of the λ-Architecture for collecting and analysing tweets
cassandra kafka lambda-architecture sbt scala spark
Last synced: 19 Dec 2024