Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-11 00:28:31 UTC
- JSON Representation
https://github.com/melezhik/sparrowdo-spark
Quick Spark Installer for CentOS and Docker
Last synced: 23 Dec 2024
https://github.com/ev2900/emr_studio_stock_price_demo
Demo EMR Studio notebook using PySpark to explore Stock Price Data
Last synced: 23 Dec 2024
https://github.com/ev2900/glue_spark_history_server
Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs
aws glue spark spark-history-server spark-ui
Last synced: 23 Dec 2024
https://github.com/manuparra/clustering-openstack
Make a dynamic and customizable cluster with OpenStack
cluster deployment hadoop openstack openstack-command script slave-nodes spark
Last synced: 27 Dec 2024
https://github.com/thdaraujo/cheat
A handful of cheatsheets and programming tips.
bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop
Last synced: 24 Jan 2025
https://github.com/tianzhipeng-git/wdsdatasource
WdsDataSource is a Spark data source implementation that allows reading and writing data in WebDataset format
Last synced: 21 Jan 2025
https://github.com/chrispyl/learning-latent-representations-for-nitrogen-response-rate-prediction
Implementation for the paper 'Learning latent representations for operational nitrogen response rate prediction'
Last synced: 17 Jan 2025
https://github.com/mileszim/ember-particle
Ember service for the Particle API
api ember ember-addon ember-cli-addon iot particle particle-io spark
Last synced: 27 Jan 2025
https://github.com/iversonson/spark-lite-document-translator
This project aims to provide a fast and efficient document translation solution using Spark Lite's machine learning APIs
Last synced: 17 Jan 2025
https://github.com/positlabs/spark-picker-animations
Animated Native UI Picker Icons in Spark AR
augmented-reality instagram spark spark-ar
Last synced: 02 Feb 2025
https://github.com/fishercoder1534/hbaseexample
aws-s3 cluster hadoop-mapreduce hbase hive spark sparkjava
Last synced: 20 Jan 2025
https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024
An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024
dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark
Last synced: 21 Jan 2025
https://github.com/oracle-quickstart/oci-hortonworks
Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)
cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform
Last synced: 07 Nov 2024
https://github.com/vicnesterenko/apache-spark-labs
Base programs with datasets
apache-spark kpi-fict kpi-ua spark
Last synced: 10 Jan 2025
https://github.com/rockfordwei/anagram
Anagram Solution Servers in Different Languages/Frameworks
anagram hdfs java javascript php python server spark swift
Last synced: 12 Jan 2025
https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer
Kafka consumer example based on spark streaming with message formatting to spark dataframe
kafka kafka-consumer scala spark spark-streaming
Last synced: 07 Feb 2025
https://github.com/darenr/spark-pca
Dimensional reduction, Scatter, Hexbin and kde plots
Last synced: 05 Feb 2025
https://github.com/silvanheller/parquet-demo
Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV
benchmark orc parquet r scala spark university-project
Last synced: 27 Jan 2025
https://github.com/evegen55/car_number_recognizer
computer-vision neural-networks spark
Last synced: 22 Jan 2025
https://github.com/dunnkers/pyspark-bucketmap
Easily group pyspark data into buckets and map them to different values.
bucketizer categorizer pyspark pyspark-mllib python python3 spark
Last synced: 29 Jan 2025
https://github.com/vietdoo/sg-property-hub
SG Property Hub is a comprehensive platform for managing and analyzing property data.
airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping
Last synced: 07 Feb 2025
https://github.com/ronaldkanyepi/log-realtime-analysis
A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.
dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization
Last synced: 25 Dec 2024
https://github.com/zncdatadev/spark-k8s-operator
Operator for Apache Spark-on-Kubernetes of the Kubernetes Data Stack
Last synced: 19 Nov 2024
https://github.com/hpgrahsl/gab2016streamanalytics
Repository with materials for my Session at Global Azure Bootcamp 2016
azure bootcamp spark storm streamanalytics
Last synced: 08 Jan 2025
https://github.com/izeigerman/twinkle
The collection of helpers and utils for Apache Spark
Last synced: 08 Feb 2025
https://github.com/fsanaulla/spark-http-rdd
RDD primitive for fetching data from an HTTP source
Last synced: 12 Oct 2024
https://github.com/fanqingsong/machine_learning_system_on_spark
a simple machine learning system demo(cluster and predict on iris data), for ML study. Based on machine_learning_system repo, add new process for ml model service with celery and spark.
celery django machine-learning reactjs spark
Last synced: 21 Dec 2024
https://github.com/ferranbt/sparkanywhere
Run Apache spark multicloud and serverless
Last synced: 01 Jan 2025
https://github.com/exasol/spark-connector-common-java
Common library for Exasol Apache Spark based connectors
apache-spark exasol exasol-integration spark streaming
Last synced: 09 Feb 2025
https://github.com/tuancamtbtx/spark-build-tool
Generate Spark Job From This Tool
Last synced: 11 Oct 2024
https://github.com/giuliosmall/twitter-trending-topics-pipeline
This project demonstrates trending topic detection using Apache Spark and MinIO. It processes Twitter JSON data with PySpark, leveraging distributed data processing and cloud storage. The entire project is containerized with Docker for easy deployment across architectures.
docker minio nlp pyspark pytest spacy spark streamlit
Last synced: 05 Feb 2025
https://github.com/alimarzouk/paris-aq
ELTL pipeline to monitor air quality in the Paris Île-de-France area
airflow airquality big-data bigquery dataengineering gcs spark
Last synced: 22 Jan 2025
https://github.com/vermicida/data-lake
Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program
aws-s3 data-engineering data-lake etl-pipeline python spark
Last synced: 26 Dec 2024
https://github.com/vs4vijay/proof-of-concepts
A set of PoC which I had worked on
apache apache-kafka apache-spark authentication bitcoin blockchain blockchain-technology chirp chirp-sdk flask gui kafka poc proof-of-concept proof-of-work pykafka pyspark python python3 spark
Last synced: 10 Jan 2025
https://github.com/arun-george-zachariah/twitteranalytics
Web application to visualize interesting analytic Spark SQL queries executed on tweets for five famous brands namely Adidas, Nike, Puma, Skechers, and Reebok.
analytics distributed-computing docker spark twitter
Last synced: 26 Dec 2024
https://github.com/same-ou/spark-hdfs-ml
Spark and HDFS cluster using Docker and Docker Compose
Last synced: 25 Dec 2024
https://github.com/fbraza/data-processing-scala-spark
A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.
Last synced: 26 Jan 2025
https://github.com/manojpawar94/spark-scala-examples
I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.
apache-spark spark spark-rdd spark-sql
Last synced: 13 Jan 2025
https://github.com/crazybber/go-jupyter
spark big data exploring in jupyterlab
bigdata jupyter-notebook jupyterlab rdd spark
Last synced: 28 Jan 2025
https://github.com/tuancamtbtx/etl-spark-k8s
ETL With Apache Spark Deployed on K8s
apache k8s spark spark-sql spark-streaming
Last synced: 02 Jan 2025
https://github.com/tuancamtbtx/python-spark-example
Spark template to submit to cluster
Last synced: 02 Jan 2025
https://github.com/tuancamtbtx/bigdata-spark-processing
Spark Batch Process
Last synced: 02 Jan 2025
https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.
computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student
Last synced: 18 Jan 2025
https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd
A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.
computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student
Last synced: 18 Jan 2025
https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark
This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.
apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api
Last synced: 03 Jan 2025
https://github.com/hexnn/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成
clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks
Last synced: 21 Dec 2024
https://github.com/fpopic/hf-interview-challenge
(Interview) Mixin Data Engineering & Data Science with PySpark
data-engineering data-science pyspark python recipes spark
Last synced: 10 Jan 2025
https://github.com/scrapcodes/kafkaproducer
Benchmarks to measure latency using spark and kafka.
Last synced: 03 Jan 2025
https://github.com/f-lab-edu/league-of-legends-data-solution
‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.
Last synced: 11 Oct 2024
https://github.com/rishav273/spark-cluster-multi-node-setup
Quickly setup and simulate a multi node spark cluster using docker and docker-compose.
docker docker-compose pyspark python3 spark
Last synced: 11 Oct 2024
https://github.com/scrapcodes/spark-templates
One stop shop for Apache spark starter samples.
Last synced: 03 Jan 2025
https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks
San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.
Last synced: 16 Nov 2024
https://github.com/sankamuk/aws-kinesis-redshift-sparkstream
Spark Structured Streaming from AWS Kinesis and Redshift
aws kinesis pyspark redshift spark structured-streaming terraform
Last synced: 13 Jan 2025
https://github.com/chimera-suite/spark-sidecar-setup
The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.
docker setup sidecar-container spark sparksql
Last synced: 03 Jan 2025
https://github.com/chimera-suite/use-case
A step-by-step tutorial that showcases the capabilities of Chimera
chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query
Last synced: 03 Jan 2025
https://github.com/nikoshet/pyspark-movie-similarities
Using Spark In Python For Movie Similarities With Jaccard Index
jaccard-index movie-similarities pyspark spark
Last synced: 03 Jan 2025
https://github.com/nikoshet/spark-mlp
Multilayer Perceptron Implementation Using Spark
hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark
Last synced: 03 Jan 2025
https://github.com/palutz/functionalscala_coursera
Functional programming in Scala Certification path (EPFL)
big-data coursera distributed-computing functional-programming parallel-computing scala spark
Last synced: 17 Jan 2025
https://github.com/pierrekieffer/genericsupervisedmachinelearning
Generic supervised machine learning application
Last synced: 07 Feb 2025
https://github.com/akaliutau/spark-recipes
Contains a collection of data processing solutions built on the top of Spark
Last synced: 11 Jan 2025
https://github.com/marcorfilacarreras/matemaquest
A simple API to get information of the "Pruebas Canguro" exams
api docker github-actions java math mathematics spark
Last synced: 13 Jan 2025
https://github.com/azlinrusnan/movielens_data_analysis_with_mongodb_and_cassandra
This project presents an analysis of the MovieLens 100k dataset using Apache Spark integrated with MongoDB and Cassandra. The dataset includes user information, movie ratings, and movie details, providing a comprehensive basis for exploring user preferences and movie popularity.
cassandra ml-100k mongodb python spark
Last synced: 17 Jan 2025
https://github.com/drsnowbird/nlp-deeplearning-projects
NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)
chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow
Last synced: 13 Jan 2025
https://github.com/abdellatif-laghjaj/big-data-project
Big data and image processing project
big-data facedetection image-preprocessing image-processing pyspark realtime-detection spark
Last synced: 17 Jan 2025
https://github.com/shayartt/streaming-orders
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS
databricks etl kafka python spark spark-streaming
Last synced: 12 Oct 2024
https://github.com/divithraju/divith-raju-data-mining
This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.
algorthims analytics apache business client connector data dataarchitecture database dataengineering datamining datascience hadoop k-means-clustering mysql project project-repository pyspark python3 spark
Last synced: 17 Jan 2025
https://github.com/tsovak/spark-demo
The Spark REST API with Spring Boot and MongoDB
docker-compose mongodb rest-api spark sparkjava sparkrest spring-boot
Last synced: 08 Feb 2025
https://github.com/NashTech-Labs/spark-on-mesos
deployment mesos spark word-count
Last synced: 23 Oct 2024
https://github.com/azurespheredev/microsoftfabric-exploratorium
A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics
analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse
Last synced: 11 Jan 2025
https://github.com/stabrise/scaledp-tutorials
Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.
ner nlp ocr ocr-python pdf spark
Last synced: 30 Jan 2025
https://github.com/tallamjr/jetspark
Spark cluster on Jetson TX2 mini-project
Last synced: 10 Feb 2025
https://github.com/aldantanneo/bigints
WIP constant time bigint implementation in SPARK
ada bigint cryptography formal-verification spark
Last synced: 30 Jan 2025
https://github.com/abrahamkoloboe27/random-user-streaming-pipeline
Data Engeenering Project - Data Pipeline
airflow airflow-dags api docker docker-compose etl etl-pipeline extract-transform-load kafka kafka-consumer kafka-producer makefile orchestration postgresql python schema-registry spark spark-streaming zookeeper
Last synced: 30 Jan 2025
https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course
Integrate Spark with queue system course examples
apache-spark aws-sqs kafka rabbitmq spark
Last synced: 30 Jan 2025
https://github.com/mahi97/internship-elk-loganalysis
~ The Report of Development and Deployment an ELK Stack for MCI BI softwares and servers to perform real-time log analysis
elasticsearch kafka kibana latex logstash mesos redis spark
Last synced: 05 Feb 2025
https://github.com/alexott/cyber-spark-data-connectors
Cybersecurity-related custom data connectors for Spark
cybersecurity databricks pyspark spark
Last synced: 30 Jan 2025
https://github.com/hungreeee/reddit-realtime-streaming-pipeline
End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.
cassandra docker-compose kafka praw-reddit real-time reddit-api spark
Last synced: 12 Jan 2025
https://github.com/librity/rtjvm_spark_essentials
Rock The JVM - Apache Spark Essentials
apache-spark big-data docker scala spark spark-sql
Last synced: 08 Jan 2025
https://github.com/20cent16/airflow-spark
If you want to use airflow with spark, ready to use ;-)
Last synced: 11 Oct 2024
https://github.com/rupeshtr78/awsiot
AWS IOT Intergration Using EMR Spark Kinesis
aws aws-emr dynamodb iot kinesis spark spark-streaming
Last synced: 12 Jan 2025
https://github.com/inf0rmatiker/model-service
A service providing federated model training for spatially-segregated data.
Last synced: 08 Jan 2025
https://github.com/rupeshtr78/aws-emr
Spark Job on Amazon EMR cluster
aws cluster emr-cluster mapreduce mapredue scala spark
Last synced: 12 Jan 2025
https://github.com/sebastianruizm/pyspark-graphframes
Análisis de datos con GraphFrames y PySpark
Last synced: 08 Jan 2025
https://github.com/jbris/time-series-airflow-kafka-spark
A simple demonstration of an Airflow-Kafka-Spark (AKS) stack for online time series forecasting.
airflow airflow-dags bentoml bentoml-service kafka kafka-consumer kafka-producer kafka-streams minio mlflow mlflow-tracking-server mlops mlops-workflow online-learning spark spark-sql spark-streaming time-series time-series-analysis time-series-forecasting
Last synced: 12 Jan 2025