Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-15 00:25:38 UTC
- JSON Representation
https://github.com/mdh266/twittersentimentanalysis
Twitter Sentiment Analysis using Spark, MongoDB, and Google Cloud
data-science etl google-cloud machine-learning mongodb natural-language-processing nlp pyspark sentiment-analysis spark sparkml twitter twitter-sentiment-analysis
Last synced: 04 Dec 2024
https://github.com/omarhimada/floyo-ml-scala
Distributed ML for eCommerce platforms (recommendations, churn prediction, segmentation) written in Scala, using Spark MLlib, Elasticsearch and AWS SDK
Last synced: 09 Feb 2025
https://github.com/timvisee/hhs-p7-movie-recommendation-engine
:movie_camera: Big data project for college (HHS) period 7
algorithm hadoop recommendation-engine spark
Last synced: 15 Jan 2025
https://github.com/vivek-bombatkar/dataworkssummit2018_spark_ml
hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud.
apache-spark linear-regression machine-learning spark workshop
Last synced: 08 Nov 2024
https://github.com/spycsh/runspec
an android streaming running app with backend based on kafka+spark+mongodb
android iot kafka leafletjs mongodb restlet spark stompwebsocket ubiquitous-computing
Last synced: 05 Dec 2024
https://github.com/easonlai/databricks_odbc_connection_to_azure_sql_db_with_azure_ad_user_access_token
Making ODBC connection from Databricks (Azure Databricks) to Azure SQL Database with Azure AD User Access Token.
azure azuread azuredatabricks azuresql azuresqldb bigdata data-analysis dataanalysis dataanalytics databricks databricks-notebooks datascience microsoft microsoft-azure microsoftazure odbc odbc-driver pandas pyodbc spark
Last synced: 08 Jan 2025
https://github.com/ahmetfurkandemir/trendyol-data-engineering-technical-case-study
Trendyol Data Engineering Technical Case Study.
apache-spark case-study data-engineering debian docker maven scala spark trendyol trendyoltech ubuntu
Last synced: 17 Jan 2025
https://github.com/michabirklbauer/mahout_docker
Running Apache Mahout in Docker.
apache docker dockerfile hadoop mahout maven spark
Last synced: 04 Jan 2025
https://github.com/vanessaaleung/ad-ctr-prediction
Ads Click-Through-Rate Prediction
ctr deep-learning prediction python scikit-learn spark
Last synced: 08 Jan 2025
https://github.com/lgautier/pragmatic-polyglot-data-analysis
Docker container for off-the-shelf jupyter notebook + Python + R + Spark/pyspark + LLVM
docker-container jupyter-notebook python r spark
Last synced: 10 Nov 2024
https://github.com/curycu/sparkstudy
example codes for spark sql data wrangling
Last synced: 05 Nov 2024
https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale
Fighting Bots at Scale: Identifying Bottlenecks & Best Practice
Last synced: 25 Dec 2024
https://github.com/felipekunzler/spark-twitter-analysis
Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.
Last synced: 30 Oct 2024
https://github.com/dustin-decker/lognom
Simple script for processing streaming data from Redis using Apache Spark
elasticsearch kafka redis spark
Last synced: 29 Jan 2025
https://github.com/neo4j-field/bigquery-connector
Bi-directional connectivity between Google BigQuery and Neo4j AuraDS
arrow-flight bigquery neo4j protobuf python spark
Last synced: 23 Dec 2024
https://github.com/imlegend19/vidspark
VidSpark is a prototype video CMS backend system powered by spark and elasticsearch
celery elasticsearch python redis scala spark
Last synced: 14 Jan 2025
https://github.com/wlongxiang/pyspark_docker
Run pyspark cluster with docker on your local laptop
docker docker-compose pyspark pyspark-docker pyspark-tutorial spark
Last synced: 17 Dec 2024
https://github.com/colemurray/movie-rec-tutorial
cloud-infrastructure google-dataproc machine-learning spark tutorial
Last synced: 13 Dec 2024
https://github.com/tayeva/satellite-kafka-spark-delta-lake-pipeline-example
Demo App - Satellite Produce Consumer App
cpp17 delta-lake docker docker-compose flatbuffers java kafka parquet scala spark spark-streaming
Last synced: 12 Feb 2025
https://github.com/hdfgroup/hdf5-spark-connector
HDF5 Connector for Apache Spark
Last synced: 19 Dec 2024
https://github.com/s8sg/spark-py-submit
A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)
cdh hdfs hdp python-library spark spark-clusters spark-job
Last synced: 01 Feb 2025
https://github.com/sebastianruizm/spark-kafka-cassandra
Demo Spark Structured Streaming + Apache Kafka + Apache Cassandra
cassandra docker kafka spark structured-streaming
Last synced: 11 Nov 2024
https://github.com/puharesource/simplemavenrepository
A simple self hosted maven repository solution, written in Kotlin using the SparkJava framework.
kotlin maven repository spark sparkjava
Last synced: 01 Jan 2025
https://github.com/anant/example-cassandra-spark-elasticsearch
cassandra datastax docker elasticsearch scala spark spark-sql
Last synced: 19 Jan 2025
https://github.com/jacopodl/spark
Low level network library :satellite: :zap:
c low-level network network-programming networking raw raw-data raw-sockets spark
Last synced: 31 Jan 2025
https://github.com/enoy19/keyboard-light-composer-mc-connector
Minecraft Forge Mod to access stats in Minecraft within the Keyboard Light Composer (https://github.com/enoy19/keyboard-light-composer)
composer forge g910 keyboard light logitech minecraft mod orion rgb spark spectrum
Last synced: 17 Jan 2025
https://github.com/sircamp/spark-pspectrum
P-spectrum embedding and sequence relaxation for NLP in Spark
big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum
Last synced: 20 Jan 2025
https://github.com/abdelmajidlh/spark-functionality-repo
Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.
apache apachespark databricks databricks-notebooks pyspark python3 scala spark
Last synced: 05 Feb 2025
https://github.com/yandex-cloud/yc-delta
Delta Lake для Yandex Data Processing
delta delta-lake deltalake spark yandex-cloud
Last synced: 11 Nov 2024
https://github.com/julienpeloton/mini_spark_broker
Design and proof-of-concept for a Broker for astronomy using Apache Spark
docker kafka python spark spark-structured-streaming
Last synced: 14 Feb 2025
https://github.com/jlgarridol/tfm-fis-if
Big Data Architecture of queues for real time video processing
big-data docker kafka parkinsons-disease spark streaming streaming-video
Last synced: 13 Jan 2025
https://github.com/exacaster/markdown_frames
Markdown tables parsing to pySpark/Pandas DataFrames
Last synced: 11 Nov 2024
https://github.com/radeity/spark-proxy
push-based calculation for spark application
distributed-computing spark volunteer-computing
Last synced: 01 Feb 2025
https://github.com/aromoh/keras-distributed-streaming
Distributed Keras model for making predictions of sentiment from Spanish sentences in stream context using Spark Streaming and Apache Kafka
cnn-keras kafka keras keras-tensorflow pyspark-notebook sentiment-analysis spark spark-streaming
Last synced: 03 Feb 2025
https://github.com/anicolaspp/maprdbconnector
An independent MapR-DB Connector for Apache Spark that fully utilizes MapR-DB secondary indexes
database-connector mapr mapr-db maprdb-spark ojai scala spark
Last synced: 16 Nov 2024
https://github.com/anicolaspp/mapr-data-gen
Data generator for MapR Data Platform
data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark
Last synced: 16 Jan 2025
https://github.com/abronte/pysparkgateway
Connect to remote Spark clusters seamlessly.
apache-spark bigdata pyspark python spark
Last synced: 28 Oct 2024
https://github.com/guiferviz/tuberia
Data engineering meets software engineering
data data-engineering expectations pipeline python spark
Last synced: 13 Feb 2025
https://github.com/wittline/livyc
Apache Spark as a Service with Apache Livy Client
apache-livy apache-spark big-data data-engineering dataengineering docker livy-client livy-docker pyhton spark
Last synced: 14 Oct 2024
https://github.com/conema/spark-terraform
This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform
Last synced: 20 Nov 2024
https://github.com/fiqryq/spark-minimal-gray
🥰 Simple Instagram Filter Using Spark Ar studio by Facebook.
Last synced: 26 Jan 2025
https://github.com/fiqryq/sparkar-pekerjaan-impian
Instagram Filter Using Spark AR
augmented-reality-applications facebook instagram spark
Last synced: 26 Jan 2025
https://github.com/dllllb/ml-pipelines-tutorial
SciKit-Learn vs Apache Spark pipelines
machine-learning scikit-learn spark
Last synced: 19 Jan 2025
https://github.com/smsraj2001/stream-batch-processing-kafka-spark
A project which includes simulation of real time queries by kafka and performing stream and batch processing of the simulated queries by spark. Also, this follows lambda architecture, in which kafka is publisher and spark helps in subscribing
batch-processing kafka kafka-topics lambda-architecture mysql-database no-api pub-sub pyspark python3 realtime spark streaming ubuntu2204 zookeeper
Last synced: 01 Jan 2025
https://github.com/ahmetfurkandemir/hepsiburada-data-engineering-project
Hepsiburada Data Engineering Project
Last synced: 17 Jan 2025
https://github.com/mrcolorr/supreme-pancake
Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.
big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one
Last synced: 13 Nov 2024
https://github.com/chen0040/spark-opt-moea
Distributed Multi-Objective Evolutionary Computation Framework for Spark
moea multi-objective-optimization nsga-ii spark
Last synced: 09 Feb 2025
https://github.com/bonigarcia/spark-examples
Collection of Spark examples using Python
cassandra influxdb kafka python spark spark-streaming
Last synced: 08 Feb 2025
https://github.com/adrigrillo/nycsparktaxi
Apache Spark application to get the top ten frequent routes and profitable areas
big-data nyc parquet-files python spark taxi
Last synced: 10 Feb 2025
https://github.com/tomwhite/disq-original
A library for manipulating bioinformatics sequencing formats in Apache Spark.
bioinformatics genomics ngs sequencing spark
Last synced: 10 Feb 2025
https://github.com/ging/fiware-cosmos
The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.
analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine
Last synced: 01 Nov 2024
https://github.com/multivacplatform/multivac-kaggle-titanic
Simple example of Titanic competition by Spark 2.2
kaggle-competition machine-learning scala spark
Last synced: 12 Jan 2025
https://github.com/longshilin/spark-wordcount
spark wordcount example | build in Eclipse+Maven+Scala Project+Spark
helloworld maven scala scala-programming spark wordcount
Last synced: 10 Nov 2024
https://github.com/harshoza36/movielens_pyspark
MovieLens Dataset analysis using Hadoop and Pyspark
big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql
Last synced: 10 Jan 2025
https://github.com/adovasoft-rnd/ci-recharge
composer test
ci4 cli codeigniter codeigniter4 commandline controller db library make migration mode mysql php seeds spark sql
Last synced: 14 Oct 2024
https://github.com/highoncarbs/lumberjack
:pick: Search and analyse your logs efficienlty with Lumberjack
analysis flask log logging python spark web-dashboard
Last synced: 14 Oct 2024
https://github.com/gabfr/truck-data-wrangler
ELT (Extract, Load, Transform) process of accelerometer/gyroscope events with Apache Spark (w/ Structured Streaming) and TimescaleDB
data-classification spark stream timescaledb
Last synced: 07 Dec 2024
https://github.com/ashton-sidhu/sysmon-extract
Extract logs based off events from sysmon. Comes as a package, cli and ui.
data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting
Last synced: 09 Nov 2024
https://github.com/majobasgall/smote-mr
SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)
big-data imbalanced-data machile-learning scala smote spark
Last synced: 08 Jan 2025
https://github.com/chaokunyang/athena
A task scheduler for spark, flink, mapreduce, java, python, bash
flink hadoop mapreduce spark task-manager task-scheduler
Last synced: 19 Nov 2024
https://github.com/ichowdhury01/match
A social networking platform that allows users to find friends with similar interests in their area.
geolocation-api jdbc maven mysql pbkdf2 spark
Last synced: 06 Feb 2025
https://github.com/viniciusmsousa/pyspark-ds-toolbox
A Pyspark companion for data science tasks.
Last synced: 09 Feb 2025
https://github.com/kwartile/spark-benchmark
Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.
apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark
Last synced: 08 Feb 2025
https://github.com/jinsyin/datalink
⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink
batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming
Last synced: 15 Nov 2024
https://github.com/nhviet03/is405_bigdata_mapreduce_knn
A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark
knn-classification mapreduce pyspark spark
Last synced: 17 Jan 2025
https://github.com/joyceannie/us-immigrations-data-warehouse
A data warehouse to perform analytics on the immigration trends in the US.
airflow data-engineering etl pyspark redshift s3 spark
Last synced: 29 Jan 2025
https://github.com/yoongoing/bigdata_pyspark
⚡️공개용 맵리듀스 플랫폼인 Spark를 사용하여 데이터마이닝을 해보자⚡️
bigdata dataminig jupyter-notebook mapreduce mapreduce-python pyspark spark
Last synced: 09 Feb 2025
https://github.com/anant/example-cassandra-spark-job-scala
apache-spark cassandra docker etl sbt scala spark
Last synced: 19 Jan 2025
https://github.com/renardeinside/terrametria
Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.
databricks deckgl python react spark
Last synced: 03 Dec 2024
https://github.com/derlin/workshop-data-sciences
A two-days workshop material introducing data sciences for big data
big-data data-science hdfs hive spark workshop workshop-materials zeppelin
Last synced: 20 Jan 2025
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/policratus/sparkmage
🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.
big-data computer-vision hadoop image-processing spark
Last synced: 02 Nov 2024
https://github.com/dvelkow/real_time_bulgarian_news_aggregator
An ETL-driven web scraping and data visualization project that aggregates news from multiple Bulgarian news sources in real-time and creates an interactive dashboard with the fetched data.
Last synced: 14 Feb 2025
https://github.com/tranthe170/nyc-taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 17 Jan 2025
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/simplexspatial/osm-facts
Proofs and checks about osm pbf format and data content facts
Last synced: 15 Jan 2025
https://github.com/multivacplatform/multivac-fakenews
Detecting users and communities which propagate fake news on Twitter by Apache Spark
deep-learning fakenews machine-learning spark twitter
Last synced: 12 Jan 2025
https://github.com/burhanahmed1/big-data-analytics
Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.
apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql
Last synced: 14 Feb 2025
https://github.com/cclient/elasticsearch-spark-upsert-from-kafka
elasticsearch-hadoop官方不支持upsert doc,修改源码实现,spark kafka streaming 示例 upsert { "upsert": {}, "doc": {...} }
elasticsearch elasticsearch-hadoop kafka kafka-streams spark upsert upsert-doc
Last synced: 16 Jan 2025
https://github.com/exacaster/delta-fetch
HTTP API on Delta Lake tables
big-data delta-lake parquet s3 spark
Last synced: 11 Nov 2024
https://github.com/inbravo/spark-movie-lens
Various examples of analytics using Apache Spark
Last synced: 02 Feb 2025
https://github.com/pomadchin/vlm-performance
GeoTrellis RasterSources Ingest benchmark
aws emr geotrellis gis raster spark
Last synced: 17 Jan 2025
https://github.com/prabaprakash/docker-pipeline-for-hadoop-n-spark-submit
Docker CI/CD Pipeline
apache-spark docker docker-compose docker-pipeline gocd-agent gocd-agent-docker gocd-server hadoop spark
Last synced: 14 Jan 2025
https://github.com/angelcervera/poc-drivingdistance
Proof of concept to implement a service to calculate the driving distance using osm network
akka openstreetmap osm osm4scala scala spark
Last synced: 10 Feb 2025
https://github.com/emso-exe/comercio_eletronico_brasileiro
Projeto de análise de dados do comércio eletrônico brasileiro disponibilizado pela Olist via plataforma Kaggle.
analise-de-dados ciencia-de-dados data-analytics data-science datascience e-commerce postgres postgresql pyspark python python-3 python3 spark spark-sql sql
Last synced: 16 Jan 2025
https://github.com/tranthe170/NYC-Taxi-pipeline
Building Data Lakehouse by open source technology. Support end to end data pipeline, from source data on AWS S3 to Lakehouse, visualize.
airflow delta-lake hive lakehouse presto python s3 spark superset
Last synced: 14 Feb 2025
https://github.com/debanjansarkar/pyspark-maestro
This repo contains implementations of PySpark for real-world use cases for batch data processing, streaming data processing sourced from Kafka, sockets, etc., spark optimizations, business specific bigdata processing scenario solutions, and machine learning use cases.
json kafka kafka-python kafka-streams pyspark pyspark-api pyspark-machine-learning pyspark-mllib pyspark-streaming python3 spark spark-mllib spark-sql spark-streaming
Last synced: 14 Feb 2025