Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-14 00:24:01 UTC
- JSON Representation
https://github.com/univalence/spark-plumbus
Collection of tools for Scala Spark
functional-programming scala spark
Last synced: 20 Jan 2025
https://github.com/wtanaka/ansible-role-apache-spark
Ansible role to install Apache Spark
ansible ansible-galaxy ansible-role ansible-roles apache-spark batch galaxy mapreduce spark streaming
Last synced: 23 Jan 2025
https://github.com/rezacsedu/Mining-Maximal-Frequent-Pattern-Spark
Implementation of Static mining part of "Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach" Information Sciences, Volume 432, March 2018, Pages 278-300
data-mining data-stream frequent-pattern-mining java maximal-frequent-pattern spark structured-streaming
Last synced: 30 Oct 2024
https://github.com/stefen-taime/investissement
Jenkins Delta pipeline
delta-lake jenkins-pipeline minio spark
Last synced: 23 Jan 2025
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 02 Nov 2024
https://github.com/cclient/elasticsearch-spark-upsert-from-kafka
elasticsearch-hadoop官方不支持upsert doc,修改源码实现,spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }
elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc
Last synced: 16 Jan 2025
https://github.com/kanchishimono/spark-on-k8s-images
Docker images for spark on kubernetes
docker docker-image dockerfile kubernetes pyspark spark spark-kubernetes spark-on-k8s spark-on-kubernetes
Last synced: 28 Nov 2024
https://github.com/pankajsingh09/data_engineering_using_aws
This Repository contains the contents related to Data Engineering Using AWS
aws data-ingestion dataengineering event-bridge lambda-functions pipeline pycharm-ide pyspark python s3 spark
Last synced: 12 Feb 2025
https://github.com/brooksian/sparkpipeline2mleapbundle
Convert Spark Pipeline Models to MLeap Bundles
Last synced: 19 Jan 2025
https://github.com/brooksian/epaairnow
Exploring EPA Air Now Time Series Data with Apache Spark and Apache Zeppelin
spark sparksql time-series zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/lmouhib/auto-register-spark-ui-k8s
A lightweight operator to automatically expose Spark UI manage its ingress when running Spark on Kubernetes
spark spark-kubernetes spark-sql spark-streaming spark-ui
Last synced: 10 Feb 2025
https://github.com/bluejoe2008/hippo-rpc
Hippo Transport Library enhances spark-commons with easy stream management & handling
Last synced: 10 Feb 2025
https://github.com/pedropark99/introd-pyspark
An open and introductory book for the Python API of Apache Spark (pyspark)
Last synced: 14 Oct 2024
https://github.com/piotr-kalanski/spark-local
API enabling switching between Spark execution engine and local fast implementation based on Scala collections.
Last synced: 13 Feb 2025
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/brooksian/ds_gtdb
KMeans Clustering on Global Terrorism Database
global-terrorism-database machine-learning spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 02 Jan 2025
https://github.com/angelcervera/poc-drivingdistance
Proof of concept to implement a service to calculate the driving distance using osm network
akka openstreetmap osm osm4scala scala spark
Last synced: 10 Feb 2025
https://github.com/brooksian/solrtosparknotebook
Connecting Solr and Spark In An Apache Zeppelin Notebook
Last synced: 19 Jan 2025
https://github.com/angelotc/MacroDAG
A Dockerized Airflow ETL pipeline that processes macroeconomic indicators from the Federal Reserve.
Last synced: 06 Nov 2024
https://github.com/dimajix/docker-spark
Repository for building Docker containers for Spark
Last synced: 05 Jan 2025
https://github.com/thanaraklee/dataflow-with-gcp
This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.
airflow apache-spark data-engineering etl pandas spark
Last synced: 25 Dec 2024
https://github.com/fpopic/gg-interview-challenge
(Interview) GG Interview Challenge in Scala/Spark
apache-spark json logstash parsing regex scala spark sparksql
Last synced: 10 Jan 2025
https://github.com/alvarogarcia7/bank-kata-kotlin
Bank pet project, in kotlin. See interests as topics
api-first api-standard bank-kata blackbox-testing etude finite-state-machine gradle gradlew hateoas junit junit5 kata kotlin multimodule paypal-rest-api practice spark sparkjava trikitrok with-client
Last synced: 10 Jan 2025
https://github.com/jldbc/big-data
Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark
big-data hadoop pagerank recommender-system spark
Last synced: 04 Jan 2025
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/oracle-quickstart/oci-spark
Terraform module to deploy Spark on Oracle Cloud Infrastructure (OCI)
cloud oci oracle oracle-led spark terraform
Last synced: 07 Nov 2024
https://github.com/ugurcanerdogan/machine-learning-with-spark
BBM469*ASG3 - Machine Learning with Spark
apache-spark data-science machine-learning spark
Last synced: 12 Feb 2025
https://github.com/makohn/lambda-architecture-poc
♨️ A PoC implementation of the λ-Architecture for collecting and analysing tweets
cassandra kafka lambda-architecture sbt scala spark
Last synced: 12 Feb 2025
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 14 Feb 2025
https://github.com/bria222/animal2
heroku-deployment java postgres spark velocity
Last synced: 04 Jan 2025
https://github.com/renardeinside/dbx-kafka-protobuf-example
Sample code for working with Kafka & Protobuf in Databricks
databricks kafka protobuf scala spark spark-streaming
Last synced: 06 Feb 2025
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/afsalthaj/supaku-sukara
Functional Programming, Functional Programming Exercise Solutions in Scala & Spark
functional-programming functor language monad parallelism scala shapeless spark typeclasses
Last synced: 08 Jan 2025
https://github.com/kadnan/vagrant-spark2
Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).
pyspark python3 spark vagrant vagrant-boxes
Last synced: 20 Jan 2025
https://github.com/apache/incubator-gluten-site
Apache Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
Last synced: 04 Feb 2025
https://github.com/pomadchin/vlm-performance
GeoTrellis RasterSources Ingest benchmark
aws emr geotrellis gis raster spark
Last synced: 17 Jan 2025
https://github.com/yalishanda42/scala-recsys
Scala(-ble) recommender system architecture using functional programming (PoC)
cats cats-effect functional-programming movielens recommender-system recsys scala spark
Last synced: 28 Dec 2024
https://github.com/mangalaman93/dspark
Run spark in docker containers
big-data containers docker microservices spark
Last synced: 18 Jan 2025
https://github.com/kevinhartman/kafka-to-eventhub
Kafka to EventHub Mirror.
eventhub eventhub-topic kafka mirror spark spark-streaming
Last synced: 13 Feb 2025
https://github.com/hupe1980/docker_pyspark_notebook
Docker Compose setup for PySpark
docker docker-compose ipython jupyter-notebook jupyterlab pyspark python spark uber
Last synced: 02 Feb 2025
https://github.com/fdmsantos/aws-twitter-data-analytics
Project to Learn Data analytics in AWS using twitter data
aws data-analytics data-engineering data-science data-visualization flink spark terraform
Last synced: 26 Jan 2025
https://github.com/longi94/lsde2017-p3-flight-visualization
Animated interactive flight visualization
Last synced: 06 Jan 2025
https://github.com/hifly81/1brc_streaming
1brc challenge with streaming solutions for Apache Kafka
1brc apache camel-kafka flink kafka kafkastreams ksqldb nifi spark spring-kafka streaming
Last synced: 02 Nov 2024
https://github.com/mtpatter/bilao
Jupyter notebooks for filtering Kafka data with Spark Streaming.
avro docker jupyter-notebook kafka spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/brunneis/minebench
Proof-of-Work based benchmark written in Python that works with real Bitcoin data
benchmark bitcoin mining proof-of-work spark
Last synced: 26 Jan 2025
https://github.com/saadsalmanakram/data-processing
This repo is focused on all key frameworks, libraries or tools use for Data Processing
Last synced: 14 Feb 2025
https://github.com/erikerlandson/pyspark-ubi
Minimalist install of pyspark on top of Red Hat UBI
container-image pyspark spark ubi
Last synced: 06 Jan 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025
https://github.com/jbris/docker-spark-sparklyr
Docker setup for Apache Spark and the R sparklyr package
adminer apache-spark docker docker-compose postgres postgresql rstats rstudio spark spark-dataset spark-master spark-ml spark-worker sparklyr sparklyr-extension
Last synced: 12 Jan 2025
https://github.com/hellomaxime/data-platform-on-kubernetes
Open Source Data Platform on Kubernetes
bigdata data data-pipeline dbt druid etl kubernetes ml open-source platform spark superset
Last synced: 28 Dec 2024
https://github.com/hibadaoud/real-time-flight-data-kibana-visualization
Real-Time Flight Data Visualization Dashboard: Interactive web application for real-time flight tracking and airport analytics. Powered by Kafka, Pyspark, Elasticsearch, Kibana, Express NodeJs, MongoDB, and Docker.
css docker elasticsearch html javascript jwt-authentication kafka kibana nodejs real-time spark
Last synced: 10 Feb 2025
https://github.com/multivacplatform/multivac-fakenews
Detecting users and communities which propagate fake news on Twitter by Apache Spark
deep-learning fakenews machine-learning spark twitter
Last synced: 12 Jan 2025
https://github.com/garciparedes/scala-examples
Set of awesome Scala Examples
breeze functional-programming java scala spark
Last synced: 16 Jan 2025
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/adityajn105/apache-spark-tutorials
Apache spark is a big data analysis framework.
bigdata pyspark spark spark-ml spark-rdd spark-tutorials
Last synced: 16 Jan 2025
https://github.com/ashishgopalhattimare/parallel-concurrent-and-distributed-programming-in-java
Parallel, Concurrent, and Distributed Programming in Java | Coursera
block-isolation boruvka-algorithm concurrent-programming critical-section distributed-programming java-8 kafka locks mapreduce-java mpi parallel-programming rice-university spark synchronization threads
Last synced: 21 Jan 2025
https://github.com/open-datastudio/hive-metastore
Hive metastore on Staroid
hadoop hive hive-metastore kubernetes spark staroid
Last synced: 18 Nov 2024
https://github.com/jabhij/crimerate_classification
Developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.
classification crime-analysis crime-classification crime-rates machine-learning mllib pyspark python spark tensorflow
Last synced: 17 Jan 2025
https://github.com/kruglov-dmitry/yelp_data
End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.
cassandra kafka spark streaming yelp-dataset
Last synced: 19 Jan 2025
https://github.com/bedrockstreaming/sparktest
A testing tool for Scala and Spark developers
Last synced: 31 Dec 2024
https://github.com/aveek-saha/cricket-score-predictor
A Big data application to predict the outcome of a T20 cricket match.
big-data big-data-analytics clustering pyspark spark spark-mllib
Last synced: 24 Dec 2024
https://github.com/pprzetacznik/datalake
Simple datalake
avro data-engineering kafka parquet schema-registry spark spark-structured-streaming
Last synced: 03 Feb 2025
https://github.com/jatin-8898/sparkwebsite
A clean and very interesting looking website. :sparkles:
bootstrap4 css html javascript spark typescript
Last synced: 17 Jan 2025
https://github.com/jinsyin/datalink
⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink
batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming
Last synced: 15 Nov 2024
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/renardeinside/terrametria
Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.
databricks deckgl python react spark
Last synced: 03 Dec 2024
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/debanjansarkar/pyspark-maestro
This repo contains implementations of PySpark for real-world use cases for batch data processing, streaming data processing sourced from Kafka, sockets, etc., spark optimizations, business specific bigdata processing scenario solutions, and machine learning use cases.
json kafka kafka-python kafka-streams pyspark pyspark-api pyspark-machine-learning pyspark-mllib pyspark-streaming python3 spark spark-mllib spark-sql spark-streaming
Last synced: 14 Feb 2025
https://github.com/anant/example-cassandra-spark-job-scala
apache-spark cassandra docker etl sbt scala spark
Last synced: 19 Jan 2025
https://github.com/cwienberg/spark-sorting-helpers
Helper library for using secondary sorting in Spark RDD and Dataset operations
Last synced: 23 Jan 2025
https://github.com/j-sephb-lt-n/useful-code-snippets
A searchable collection of useful little pieces of code
aws bash cloud compute-engine docker dockerfile ec2 gcp graph pyspark python r-language shell spark virtual-machine
Last synced: 28 Dec 2024
https://github.com/rpytel1/supercomputing-labs
Fork of the repository for Supercomputing in Big Data class on TU Delft. Scala, Spark and Kafka were used to perform processing and streaming of GDelt data segments.
big-data gdelt-data kafka scala spark
Last synced: 18 Jan 2025
https://github.com/xpcosmos/injestao-dados-enem-sql
Esse projeto tem o objetivo de estruturar dados do enem em bancos de dados e analisar os dados utilizando métodos estatísticos.
docker docker-compose postgresql pyspark python spark sql statistics
Last synced: 14 Jan 2025
https://github.com/lucivpav/dnbc-scala
Parallel implementation of dynamic naive Bayesian classifier
apache-spark bayesian-networks ctu-fit dnbc fit-ctu naive-bayes-classifier scala spark
Last synced: 12 Feb 2025
https://github.com/timvw/adobe-analytics-datafeed-datasource
Apache Spark data source for Adobe Analytics Data Feed
adobe-analytics clickstream python scala spark
Last synced: 08 Nov 2024
https://github.com/vasnake/spark.ml.spatialjointransformer
spark.ml.transformer: join two datasets using spatial relations
geospatial join ml-pipeline python scala spark spark-ml spatial transformer
Last synced: 03 Jan 2025
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/nhviet03/is405_bigdata_mapreduce_knn
A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark
knn-classification mapreduce pyspark spark
Last synced: 17 Jan 2025
https://github.com/gaelfoppolo/self-service-data-analytics
Data analysis made for business users
aws big-data data-analytics hadoop spark
Last synced: 03 Feb 2025
https://github.com/badoo/hadoop-xargs
Util to run heterogenous applications on Hadoop synchronously
Last synced: 12 Nov 2024
https://github.com/conema/transe-pyspark
TransE implementation in Spark (pyspark)
aws distrubuted embedding gradient-descent knowledge-graph pyspark spark terraform transe word-embeddings
Last synced: 21 Jan 2025
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/tranthe170/nyc-taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 17 Jan 2025
https://github.com/triandicAnt/TwitterSentimentAnalytics
Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.
machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics
Last synced: 23 Oct 2024
https://github.com/omar-besbes/football-big-data
This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.
batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn
Last synced: 20 Jan 2025