Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-07 00:28:28 UTC
- JSON Representation
https://github.com/smsraj2001/stream-batch-processing-kafka-spark
A project which includes simulation of real time queries by kafka and performing stream and batch processing of the simulated queries by spark. Also, this follows lambda architecture, in which kafka is publisher and spark helps in subscribing
batch-processing kafka kafka-topics lambda-architecture mysql-database no-api pub-sub pyspark python3 realtime spark streaming ubuntu2204 zookeeper
Last synced: 01 Jan 2025
https://github.com/vivek-bombatkar/dataworkssummit2018_spark_ml
hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud.
apache-spark linear-regression machine-learning spark workshop
Last synced: 08 Nov 2024
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Nov 2024
https://github.com/spycsh/runspec
an android streaming running app with backend based on kafka+spark+mongodb
android iot kafka leafletjs mongodb restlet spark stompwebsocket ubiquitous-computing
Last synced: 05 Dec 2024
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/easonlai/databricks_odbc_connection_to_azure_sql_db_with_azure_ad_user_access_token
Making ODBC connection from Databricks (Azure Databricks) to Azure SQL Database with Azure AD User Access Token.
azure azuread azuredatabricks azuresql azuresqldb bigdata data-analysis dataanalysis dataanalytics databricks databricks-notebooks datascience microsoft microsoft-azure microsoftazure odbc odbc-driver pandas pyodbc spark
Last synced: 08 Jan 2025
https://github.com/bonigarcia/spark-examples
Collection of Spark examples using Python
cassandra influxdb kafka python spark spark-streaming
Last synced: 08 Feb 2025
https://github.com/gabfr/truck-data-wrangler
ELT (Extract, Load, Transform) process of accelerometer/gyroscope events with Apache Spark (w/ Structured Streaming) and TimescaleDB
data-classification spark stream timescaledb
Last synced: 07 Dec 2024
https://github.com/sircamp/spark-pspectrum
P-spectrum embedding and sequence relaxation for NLP in Spark
big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum
Last synced: 20 Jan 2025
https://github.com/highoncarbs/lumberjack
:pick: Search and analyse your logs efficienlty with Lumberjack
analysis flask log logging python spark web-dashboard
Last synced: 14 Oct 2024
https://github.com/sebastianruizm/spark-kafka-cassandra
Demo Spark Structured Streaming + Apache Kafka + Apache Cassandra
cassandra docker kafka spark structured-streaming
Last synced: 11 Nov 2024
https://github.com/colemurray/movie-rec-tutorial
cloud-infrastructure google-dataproc machine-learning spark tutorial
Last synced: 13 Dec 2024
https://github.com/dustin-decker/lognom
Simple script for processing streaming data from Redis using Apache Spark
elasticsearch kafka redis spark
Last synced: 29 Jan 2025
https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale
Fighting Bots at Scale: Identifying Bottlenecks & Best Practice
Last synced: 25 Dec 2024
https://github.com/felipekunzler/spark-twitter-analysis
Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.
Last synced: 30 Oct 2024
https://github.com/wlongxiang/pyspark_docker
Run pyspark cluster with docker on your local laptop
docker docker-compose pyspark pyspark-docker pyspark-tutorial spark
Last synced: 17 Dec 2024
https://github.com/tayeva/satellite-kafka-spark-delta-lake-pipeline-example
Demo App - Satellite Produce Consumer App
cpp17 delta-lake docker docker-compose flatbuffers java kafka parquet scala spark spark-streaming
Last synced: 02 Jan 2025
https://github.com/yandex-cloud/yc-delta
Delta Lake для Yandex Data Processing
delta delta-lake deltalake spark yandex-cloud
Last synced: 11 Nov 2024
https://github.com/aromoh/keras-distributed-streaming
Distributed Keras model for making predictions of sentiment from Spanish sentences in stream context using Spark Streaming and Apache Kafka
cnn-keras kafka keras keras-tensorflow pyspark-notebook sentiment-analysis spark spark-streaming
Last synced: 03 Feb 2025
https://github.com/multivacplatform/multivac-kaggle-titanic
Simple example of Titanic competition by Spark 2.2
kaggle-competition machine-learning scala spark
Last synced: 12 Jan 2025
https://github.com/lgautier/pragmatic-polyglot-data-analysis
Docker container for off-the-shelf jupyter notebook + Python + R + Spark/pyspark + LLVM
docker-container jupyter-notebook python r spark
Last synced: 10 Nov 2024
https://github.com/mrcolorr/supreme-pancake
Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.
big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one
Last synced: 13 Nov 2024
https://github.com/guiferviz/tuberia
Data engineering meets software engineering
data data-engineering expectations pipeline python spark
Last synced: 20 Dec 2024
https://github.com/fiqryq/sparkar-pekerjaan-impian
Instagram Filter Using Spark AR
augmented-reality-applications facebook instagram spark
Last synced: 26 Jan 2025
https://github.com/s8sg/spark-py-submit
A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)
cdh hdfs hdp python-library spark spark-clusters spark-job
Last synced: 01 Feb 2025
https://github.com/fiqryq/spark-minimal-gray
🥰 Simple Instagram Filter Using Spark Ar studio by Facebook.
Last synced: 26 Jan 2025
https://github.com/radeity/spark-proxy
push-based calculation for spark application
distributed-computing spark volunteer-computing
Last synced: 01 Feb 2025
https://github.com/enoy19/keyboard-light-composer-mc-connector
Minecraft Forge Mod to access stats in Minecraft within the Keyboard Light Composer (https://github.com/enoy19/keyboard-light-composer)
composer forge g910 keyboard light logitech minecraft mod orion rgb spark spectrum
Last synced: 17 Jan 2025
https://github.com/abronte/pysparkgateway
Connect to remote Spark clusters seamlessly.
apache-spark bigdata pyspark python spark
Last synced: 28 Oct 2024
https://github.com/abdelmajidlh/spark-functionality-repo
Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.
apache apachespark databricks databricks-notebooks pyspark python3 scala spark
Last synced: 05 Feb 2025
https://github.com/jacopodl/spark
Low level network library :satellite: :zap:
c low-level network network-programming networking raw raw-data raw-sockets spark
Last synced: 31 Jan 2025
https://github.com/ahmetfurkandemir/trendyol-data-engineering-technical-case-study
Trendyol Data Engineering Technical Case Study.
apache-spark case-study data-engineering debian docker maven scala spark trendyol trendyoltech ubuntu
Last synced: 17 Jan 2025
https://github.com/chaokunyang/athena
A task scheduler for spark, flink, mapreduce, java, python, bash
flink hadoop mapreduce spark task-manager task-scheduler
Last synced: 19 Nov 2024
https://github.com/julienpeloton/mini_spark_broker
Design and proof-of-concept for a Broker for astronomy using Apache Spark
docker kafka python spark spark-structured-streaming
Last synced: 11 Oct 2024
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 10 Jan 2025
https://github.com/tomwhite/disq-original
A library for manipulating bioinformatics sequencing formats in Apache Spark.
bioinformatics genomics ngs sequencing spark
Last synced: 18 Dec 2024
https://github.com/neo4j-field/bigquery-connector
Bi-directional connectivity between Google BigQuery and Neo4j AuraDS
arrow-flight bigquery neo4j protobuf python spark
Last synced: 23 Dec 2024
https://github.com/ashton-sidhu/sysmon-extract
Extract logs based off events from sysmon. Comes as a package, cli and ui.
data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting
Last synced: 09 Nov 2024
https://github.com/adovasoft-rnd/ci-recharge
composer test
ci4 cli codeigniter codeigniter4 commandline controller db library make migration mode mysql php seeds spark sql
Last synced: 14 Oct 2024
https://github.com/wittline/livyc
Apache Spark as a Service with Apache Livy Client
apache-livy apache-spark big-data data-engineering dataengineering docker livy-client livy-docker pyhton spark
Last synced: 14 Oct 2024
https://github.com/vanessaaleung/ad-ctr-prediction
Ads Click-Through-Rate Prediction
ctr deep-learning prediction python scikit-learn spark
Last synced: 08 Jan 2025
https://github.com/ahmetfurkandemir/hepsiburada-data-engineering-project
Hepsiburada Data Engineering Project
Last synced: 17 Jan 2025
https://github.com/mobiletelesystems/spark-dialect-extension
Package extending the default dialect capabilities for Spark.
etl etl-components plugin-system spark
Last synced: 11 Oct 2024
https://github.com/imlegend19/vidspark
VidSpark is a prototype video CMS backend system powered by spark and elasticsearch
celery elasticsearch python redis scala spark
Last synced: 14 Jan 2025
https://github.com/adrigrillo/nycsparktaxi
Apache Spark application to get the top ten frequent routes and profitable areas
big-data nyc parquet-files python spark taxi
Last synced: 18 Dec 2024
https://github.com/puharesource/simplemavenrepository
A simple self hosted maven repository solution, written in Kotlin using the SparkJava framework.
kotlin maven repository spark sparkjava
Last synced: 01 Jan 2025
https://github.com/longshilin/spark-wordcount
spark wordcount example | build in Eclipse+Maven+Scala Project+Spark
helloworld maven scala scala-programming spark wordcount
Last synced: 10 Nov 2024
https://github.com/anicolaspp/mapr-data-gen
Data generator for MapR Data Platform
data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark
Last synced: 16 Jan 2025
https://github.com/curycu/sparkstudy
example codes for spark sql data wrangling
Last synced: 05 Nov 2024
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 19 Jan 2025
https://github.com/hdfgroup/hdf5-spark-connector
HDF5 Connector for Apache Spark
Last synced: 19 Dec 2024
https://github.com/exacaster/markdown_frames
Markdown tables parsing to pySpark/Pandas DataFrames
Last synced: 11 Nov 2024
https://github.com/anicolaspp/maprdbconnector
An independent MapR-DB Connector for Apache Spark that fully utilizes MapR-DB secondary indexes
database-connector mapr mapr-db maprdb-spark ojai scala spark
Last synced: 16 Nov 2024
https://github.com/chen0040/spark-opt-moea
Distributed Multi-Objective Evolutionary Computation Framework for Spark
moea multi-objective-optimization nsga-ii spark
Last synced: 09 Feb 2025
https://github.com/majobasgall/smote-mr
SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
big-data imbalanced-data machile-learning scala smote spark
Last synced: 08 Jan 2025
https://github.com/pomadchin/vlm-performance
GeoTrellis RasterSources Ingest benchmark
aws emr geotrellis gis raster spark
Last synced: 17 Jan 2025
https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search
This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.
apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd
Last synced: 21 Dec 2024
https://github.com/hifly81/1brc_streaming
1brc challenge with streaming solutions for Apache Kafka
1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming
Last synced: 02 Nov 2024
https://github.com/yucl80/avrodemo
write , append avro to hdfs file
avro hdfs hive java kafka log scala spark sparksql sparkstreaming tomcat-log
Last synced: 27 Jan 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/renardeinside/dbx-kafka-protobuf-example
Sample code for working with Kafka & Protobuf in Databricks
databricks kafka protobuf scala spark spark-streaming
Last synced: 06 Feb 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://github.com/kevinhartman/kafka-to-eventhub
Kafka to EventHub Mirror.
eventhub eventhub-topic kafka mirror spark spark-streaming
Last synced: 20 Dec 2024
https://github.com/kanchishimono/spark-on-k8s-images
Docker images for spark on kubernetes
docker docker-image dockerfile kubernetes pyspark spark spark-kubernetes spark-on-k8s spark-on-kubernetes
Last synced: 28 Nov 2024
https://github.com/ashishgopalhattimare/parallel-concurrent-and-distributed-programming-in-java
Parallel, Concurrent, and Distributed Programming in Java | Coursera
block-isolation boruvka-algorithm concurrent-programming critical-section distributed-programming java-8 kafka locks mapreduce-java mpi parallel-programming rice-university spark synchronization threads
Last synced: 21 Jan 2025
https://github.com/univalence/spark-plumbus
Collection of tools for Scala Spark
functional-programming scala spark
Last synced: 20 Jan 2025
https://github.com/hussaintaj-w/spark_submit_project
An easy to use script that automatically adds files to the spark-submit command.
Last synced: 23 Jan 2025
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/engineering-research-and-development/fiware-orion-pyspark-connector
Bidirectional Orion/Orion-LD <--> PySpark Connector
cognitive fiware ngsi ngsi-ld ngsi-v2 orion orion-context-broker orion-ld processing pyspark python spark
Last synced: 17 Jan 2025
https://github.com/tranthe170/nyc-taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 17 Jan 2025
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/renardeinside/terrametria
Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.
databricks deckgl python react spark
Last synced: 03 Dec 2024
https://github.com/conema/transe-pyspark
TransE implementation in Spark (pyspark)
aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings
Last synced: 21 Jan 2025
https://github.com/bedrockstreaming/sparktest
A testing tool for Scala and Spark developers
Last synced: 31 Dec 2024
https://github.com/hb-chen/spark-elasticsearch-recommender
Zeppelin-v0.8.0 Notebook演示使用Spark -v2.3.2+ Elasticsearch-v6.3.2构建推荐系统
elasticsearch recommender spark zeppelin
Last synced: 08 Jan 2025
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/mgrojo/adasearch
Custom search engine for the Ada programming language
ada custom-search-google search-engine spark spark-ada
Last synced: 27 Oct 2024
https://github.com/hupe1980/docker_pyspark_notebook
Docker Compose setup for PySpark
docker docker-compose ipython jupyter-notebook jupyterlab pyspark python spark uber
Last synced: 02 Feb 2025
https://github.com/j-sephb-lt-n/useful-code-snippets
A searchable collection of useful little pieces of code
aws bash cloud compute-engine docker dockerfile ec2 gcp graph pyspark python r-language shell spark virtual-machine
Last synced: 28 Dec 2024
https://github.com/lucivpav/dnbc-scala
Parallel implementation of dynamic naive Bayesian classifier
apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark
Last synced: 20 Dec 2024
https://github.com/xpcosmos/injestao-dados-enem-sql
Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.
docker docker-compose postgresql pyspark python spark sql statistics
Last synced: 14 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/triandicAnt/TwitterSentimentAnalytics
Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.
machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics
Last synced: 23 Oct 2024