Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-12 00:28:12 UTC
- JSON Representation
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/kruglov-dmitry/yelp_data
End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.
cassandra kafka spark streaming yelp-dataset
Last synced: 19 Jan 2025
https://github.com/anant/example-cassandra-spark-job-scala
apache-spark cassandra docker etl sbt scala spark
Last synced: 19 Jan 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025
https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.
computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student
Last synced: 18 Jan 2025
https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark
This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.
apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api
Last synced: 03 Jan 2025
https://github.com/firefly55lm/superconductors_critical_temperature_analysis
Academic project for Big Data Laboratory
chemistry docker machine-learning physics pyspark spark
Last synced: 25 Jan 2025
https://github.com/beiyuouo/mi-store-log-analysis
👨🦽 伪·小米商城-大数据电商日志分析
flask full-stack java kafka python spark
Last synced: 02 Feb 2025
https://github.com/amthorn/qutex
A basic Queue Management System, interactable via several mediums, that resembles a mutex.
ava bot bots cisco cisco-spark cisco-spark-bot mutex queue queuebot queues qutex spark thorn webex webex-teams
Last synced: 13 Nov 2024
https://github.com/fpopic/hf-interview-challenge
(Interview) Mixin Data Engineering & Data Science with PySpark
data-engineering data-science pyspark python recipes spark
Last synced: 10 Jan 2025
https://github.com/f-lab-edu/league-of-legends-data-solution
‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.
Last synced: 11 Oct 2024
https://github.com/rishav273/spark-cluster-multi-node-setup
Quickly setup and simulate a multi node spark cluster using docker and docker-compose.
docker docker-compose pyspark python3 spark
Last synced: 11 Oct 2024
https://github.com/scrapcodes/kafkaproducer
Benchmarks to measure latency using spark and kafka.
Last synced: 03 Jan 2025
https://github.com/tupol/spark-apps.seed.g8
Create Spark applications projects based on the spark-utils library.
application scala spark template
Last synced: 17 Jan 2025
https://github.com/jimthompson5802/datascience_containers
Personal docker images for various data science software stacks
data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark
Last synced: 29 Dec 2024
https://github.com/scrapcodes/spark-templates
One stop shop for Apache spark starter samples.
Last synced: 03 Jan 2025
https://github.com/kingyiusuen/udacity-data-engineering-nanodegree
Projects for Udacity's Data Engineering Nanodegree
airflow aws aws-athena aws-glue aws-redshift aws-s3 cassandra data-engineering spark
Last synced: 23 Jan 2025
https://github.com/izeigerman/twinkle
The collection of helpers and utils for Apache Spark
Last synced: 08 Feb 2025
https://github.com/soumyadipta2020/sparkr_test
Sample Codes of Spark using R programming
r r-coding r-programming r-programming-language spark sparkr
Last synced: 05 Jan 2025
https://github.com/antonio-f/big-data-analysis-with-scala-and-spark
Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).
big-data bigdata coursera data-analysis scala spark
Last synced: 06 Feb 2025
https://github.com/s8sg/spark-standalone-cluster
Spark Standalone Cluster With Zookeeper
docker docker-compose spark zookeeper
Last synced: 01 Feb 2025
https://github.com/chimera-suite/spark-sidecar-setup
The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.
docker setup sidecar-container spark sparksql
Last synced: 03 Jan 2025
https://github.com/chimera-suite/use-case
A step-by-step tutorial that showcases the capabilities of Chimera
chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query
Last synced: 03 Jan 2025
https://github.com/chen0040/vagrant-big-data
Vagrantfiles for development in big data
cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper
Last synced: 09 Feb 2025
https://github.com/kampi/particle-mqtt
MQTT client implementation for TCP supporting devices (i. e. Argon, Photon) from Particle IoT.
cpp mqtt particle-argon particle-iot particle-swarm-optimization spark
Last synced: 21 Jan 2025
https://github.com/nikoshet/pyspark-movie-similarities
Using Spark In Python For Movie Similarities With Jaccard Index
jaccard-index movie-similarities pyspark spark
Last synced: 03 Jan 2025
https://github.com/nikoshet/spark-mlp
Multilayer Perceptron Implementation Using Spark
hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark
Last synced: 03 Jan 2025
https://github.com/pomadchin/geotiff-layer
GeoTrellis GeoTiff layer demo
aws-lambda cog geotiff-layer geotrellis geotrellis-tile-server gis spark tiff
Last synced: 17 Jan 2025
https://github.com/snexus/streaming-playground
Exploring streaming design patterns with Kafka and Spark Structural Streaming
kafka kafka-producer python spark spark-streaming
Last synced: 23 Jan 2025
https://github.com/talmago/pyspark-loglikelihood
PySpark Loglikelihood Similarity Examples
mahout pyspark recommendation-engine spark
Last synced: 03 Feb 2025
https://github.com/palutz/functionalscala_coursera
Functional programming in Scala Certification path (EPFL)
big-data coursera distributed-computing functional-programming parallel-computing scala spark
Last synced: 17 Jan 2025
https://github.com/zkan/machine-learning-with-spark-and-zeppelin
Machine Learning with Apache Spark & Zeppelin
Last synced: 12 Feb 2025
https://github.com/marcorfilacarreras/matemaquest
A simple API to get information of the "Pruebas Canguro" exams
api docker github-actions java math mathematics spark
Last synced: 13 Jan 2025
https://github.com/akaliutau/spark-recipes
Contains a collection of data processing solutions built on the top of Spark
Last synced: 11 Jan 2025
https://github.com/ishaansathaye/csc369-introdistributedcomputing
Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing
distributed-computing hadoop java map-reduce scala spark
Last synced: 09 Feb 2025
https://github.com/abtinz/cloud-computing
cassandra cassandra-driver cloud-computing docker elasticsearch hadoop hdfs kubernetes redis spark
Last synced: 21 Jan 2025
https://github.com/fiware/tutorials.big-data-spark
:blue_book: FIWARE 306: Real-time Processing of Context Data using Apache Spark
apache-spark big-data-analytics fiware fiware-cosmos orion-spark-connector spark tutorial
Last synced: 17 Nov 2024
https://github.com/drsnowbird/nlp-deeplearning-projects
NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)
chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow
Last synced: 13 Jan 2025
https://github.com/renardeinside/databricks-jobs-jsonnet
Example project with Databricks jobs and configuration management via jsonnet
Last synced: 06 Feb 2025
https://github.com/shayartt/streaming-orders
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS
databricks etl kafka python spark spark-streaming
Last synced: 12 Oct 2024
https://github.com/benitomartin/de-hotel-reviews
Data Engineering Hotel Reviews
cicd data-engineering dbt gcp jupyter-notebook looker prefect python spark sql terraform
Last synced: 31 Dec 2024
https://github.com/damianmarti/7506-spark
Notebook de las clases de 75-06 Organización de Datos - FIUBA
Last synced: 09 Feb 2025
https://github.com/vitalibo/distributed-heatmap-service
Simple distributed heatmap service on top of Apache HBase
aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot
Last synced: 27 Dec 2024
https://github.com/manuparra/clustering-openstack
Make a dynamic and customizable cluster with OpenStack
cluster deployment hadoop openstack openstack-command script slave-nodes spark
Last synced: 27 Dec 2024
https://github.com/azurespheredev/microsoftfabric-exploratorium
A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics
analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse
Last synced: 11 Jan 2025
https://github.com/giuliosmall/twitter-trending-topics-pipeline
This project demonstrates trending topic detection using Apache Spark and MinIO. It processes Twitter JSON data with PySpark, leveraging distributed data processing and cloud storage. The entire project is containerized with Docker for easy deployment across architectures.
docker minio nlp pyspark pytest spacy spark streamlit
Last synced: 05 Feb 2025
https://github.com/librity/rtjvm_spark_essentials
Rock The JVM - Apache Spark Essentials
apache-spark big-data docker scala spark spark-sql
Last synced: 08 Jan 2025
https://github.com/wgierke/distributed_data_analytics
Solutions for the hands-on sessions of the course "Distributed Data Analytics" at Hasso-Plattner-Institute using Akka and Spark.
akka data-analytics distributed inclusion-dependency spark
Last synced: 09 Feb 2025
https://github.com/tallamjr/jetspark
Spark cluster on Jetson TX2 mini-project
Last synced: 10 Feb 2025
https://github.com/stabrise/scaledp-tutorials
Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.
ner nlp ocr ocr-python pdf spark
Last synced: 30 Jan 2025
https://github.com/mounirbs/spark-livy
Spark Livy, a docker-compose solution enabling a Spark Cluster with a Livy endpoint
apache apache-spark docker docker-compose livy pyspark python spark
Last synced: 08 Feb 2025
https://github.com/sunsided/spark-atlas
Spark vs. MongoDB Atlas
data-processing docker jupyter-notebook mongodb mongodb-atlas pyspark python spark
Last synced: 20 Dec 2024
https://github.com/aldantanneo/bigints
WIP constant time bigint implementation in SPARK
ada bigint cryptography formal-verification spark
Last synced: 30 Jan 2025
https://github.com/imvision12/real-time-tracking
Real time bus tracking using MTA bus API
flask hadoop javascript leaflet python spark
Last synced: 08 Feb 2025
https://github.com/abrahamkoloboe27/random-user-streaming-pipeline
Data Engeenering Project - Data Pipeline
airflow airflow-dags api docker docker-compose etl etl-pipeline extract-transform-load kafka kafka-consumer kafka-producer makefile orchestration postgresql python schema-registry spark spark-streaming zookeeper
Last synced: 30 Jan 2025
https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course
Integrate Spark with queue system course examples
apache-spark aws-sqs kafka rabbitmq spark
Last synced: 30 Jan 2025
https://github.com/m-molaei/twitter-sentiment-analysis-using-apache-spark-
Sentiment analysis using deep learning models and FastText embedding on Apache Spark
apache-cassandra apache-spark big-data fasttext fasttext-embeddings mongodb pyspark rdd sentiment-analysis sentiment140-dataset spark
Last synced: 21 Jan 2025
https://github.com/alexott/cyber-spark-data-connectors
Cybersecurity-related custom data connectors for Spark
cybersecurity databricks pyspark spark
Last synced: 30 Jan 2025
https://github.com/hungreeee/reddit-realtime-streaming-pipeline
End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.
cassandra docker-compose kafka praw-reddit real-time reddit-api spark
Last synced: 12 Jan 2025
https://github.com/nkdwon/crud-spark
Um CRUD feito em Java com Integração do PostgreSQL e o Framework Spark utilizando o ambiente Eclipse
eclipse-ide git java maven pgadmin4 postgresql spark
Last synced: 06 Jan 2025
https://github.com/darule0/yarndiff
A rudimentary command line utility for contrasting Apache Yarn container logs.
diff difference diffing hadoop hadoop-mapreduce hive log4j mapreduce pig spark yarn yarn2
Last synced: 23 Dec 2024
https://github.com/tomfran/lastfm-users-analysis
Last FM user's data collection and analysis using Spark
Last synced: 06 Jan 2025
https://github.com/samuele-lolli/steam-recommendation-system
A basic recommendation system built with Scala and Spark
Last synced: 04 Feb 2025
https://github.com/thdaraujo/cheat
A handful of cheatsheets and programming tips.
bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop
Last synced: 24 Jan 2025
https://github.com/stefanofioravanzo/evolving-wikipedia-graph
Distributed processing of Wikipedia history files using Hadoop and Spark
distributed-processing hadoop-hdfs spark wikipedia
Last synced: 19 Jan 2025
https://github.com/nashtech-labs/spark-on-mesos
deployment mesos spark word-count
Last synced: 23 Dec 2024
https://github.com/casassg/thesis
Undergraduate final thesis: Big Data Analytics on Container Orchestrated Systems
casassg-thesis cassandra docker kubernetes latex spark thesis zeppelin
Last synced: 17 Dec 2024
https://github.com/rupeshtr78/awsiot
AWS IOT Intergration Using EMR Spark Kinesis
aws aws-emr dynamodb iot kinesis spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/rdalmarco/datascience
Estudos sobre data science, big data e machine learning
estatistica pandas python r spark sql
Last synced: 03 Jan 2025
https://github.com/yjham2002/tcp_conn_with_spark
:book: none
mysql protocol redis redis-client spark tcp tcp-server
Last synced: 06 Jan 2025
https://github.com/rupeshtr78/aws-emr
Spark Job on Amazon EMR cluster
aws cluster emr-cluster mapreduce mapredue scala spark
Last synced: 12 Jan 2025
https://github.com/felixcheung/spark-build
Build Apache Spark
apache-spark docker-image dockerfile spark
Last synced: 01 Feb 2025
https://github.com/darule0/sparkdiff
A rudimentary command line utility for contrasting Apache Spark event logs.
apache-spark compare-files diff difference diffing spark spark-sql spark-streaming sparksql
Last synced: 06 Feb 2025
https://github.com/jbris/time-series-airflow-kafka-spark
A simple demonstration of an Airflow-Kafka-Spark (AKS) stack for online time series forecasting.
airflow airflow-dags bentoml bentoml-service kafka kafka-consumer kafka-producer kafka-streams minio mlflow mlflow-tracking-server mlops mlops-workflow online-learning spark spark-sql spark-streaming time-series time-series-analysis time-series-forecasting
Last synced: 12 Jan 2025
https://github.com/tallamjr/epfl-functional-scala
Materials and worked assignments for Functional Programming with Scala Specialization on Coursera
Last synced: 10 Feb 2025
https://github.com/opt-nc/opt-temps-attente-agences-camel
Pull datas from opt-temps-attente-agences-api and store data in various systems
camel datascience dataviz glia innovation kafka opensearch relation-client spark
Last synced: 12 Dec 2024
https://github.com/pzim-devdata/data-developer
All my DATA developer projects
correlation data-analysis data-mining data-science data-visualization database folium folium-maps mongodb mysql python spark sql
Last synced: 07 Feb 2025
https://github.com/tuancamtbtx/java-spark-example
Spark ETL Generic Processor
Last synced: 02 Jan 2025
https://github.com/multivacplatform/multivac-elasticsearch
Demoing Spark 2.2 and Elasticsearch Hadoop connector
Last synced: 12 Jan 2025
https://github.com/multivacplatform/multivac-nlp
Testing and benchmarking some of the existing NLP libraries in Apache Spark
nlp spark spark-ml spark-mllib spark-nlp spark-sql stanford-corenlp word2vec
Last synced: 12 Jan 2025
https://github.com/luisfalva/ophelia
Ophelian On Mars! More than a simple framework.
dask dataframe ophelia ophelia-spark rdd spark spark-ml spark-mllib spark-streaming
Last synced: 17 Dec 2024
https://github.com/mukjepscarlet/bilibili-predict-recommend
[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测
bilibili flask hadoop html javascript prediction pyspark python recommendation spark
Last synced: 18 Jan 2025
https://github.com/bomada/sparkify
This project is the final Capstone project of the Udacity Data Scientist Nanodegree program. The aim is to learn how to manipulate realistic datasets with Spark to engineer relevant features for predicting churn. Input data is related to the fictive music streaming service Sparkify (similar to Spotify and Pandora).
churn ml music portfolio python spark streaming
Last synced: 09 Feb 2025
https://github.com/starhe/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成
clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks
Last synced: 21 Dec 2024
https://github.com/mxagar/spark_big_data_guide
This repository contains my personal guide on Spark and topics related to Big Data.
big-data hadoop machine-learning spark
Last synced: 23 Dec 2024