Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-15 00:25:38 UTC
- JSON Representation
https://github.com/hibuz/hadoop-docker
🐳 hadoop ecosystems docker image
data-engineering docker docker-compose flink hadoop hbase hive spark zeppelin
Last synced: 15 Nov 2024
https://github.com/kruglov-dmitry/yelp_data
End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.
cassandra kafka spark streaming yelp-dataset
Last synced: 19 Jan 2025
https://github.com/garciparedes/scala-examples
Set of awesome Scala Examples
breeze functional-programming java scala spark
Last synced: 16 Jan 2025
https://github.com/codelytv/spark-best_practices_and_deploy-course
Deploy Spark course examples
Last synced: 03 Dec 2024
https://github.com/aveek-saha/cricket-score-predictor
A Big data application to predict the outcome of a T20 cricket match.
big-data big-data-analytics clustering pyspark spark spark-mllib
Last synced: 24 Dec 2024
https://github.com/aiday-mar/spark-recommendation-engine
Movie recommendation system built using Spark and Scala
recommendation-system scala spark university-project
Last synced: 05 Jan 2025
https://github.com/dharaneeshvrd/spark-examples
Spark Examples
pyspark spark spark-example spark-sql spark-streaming spark-streaming-kafka spark-structured-streaming
Last synced: 07 Nov 2024
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/manuparra/taller-bigdata-con-r
Taller Big Data con Apache Spark + R desde Databricks cloud
bigdata cloudcomputing databricks r spark sparkr
Last synced: 27 Dec 2024
https://github.com/jf17/zeppelin-examples
big-data scala spark zeppelin-notebook
Last synced: 12 Jan 2025
https://github.com/chucheng92/sparkstreamingkafka
Spark Streaming logs to kafka.
kafka spark spark-streaming streaming
Last synced: 01 Feb 2025
https://github.com/felixcheung/spark-build
Build Apache Spark
apache-spark docker-image dockerfile spark
Last synced: 01 Feb 2025
https://github.com/michelderu/cassandra-csv-analytics
How to leverage Astra, DSE and Spark for analytics on large CSV files.
Last synced: 20 Jan 2025
https://github.com/declaredata/fuse_python
PySpark-compatible Python client for DeclareData Fuse Server: a blazing fast data processing engine and drop-in alternative to Spark clusters.
data-processing pyspark rust-lang spark
Last synced: 13 Jan 2025
https://github.com/pranavshashidhara/movie-recommendation-system
This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).
big-data recommendation-system spark
Last synced: 13 Jan 2025
https://github.com/tadod12/airflow-spark-job
A workspace to experiment with Apache Spark and Airflow in a Docker environment
Last synced: 13 Jan 2025
https://github.com/mauriciovazquezm/spark_bigdata_architecture_project
Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM
data-stream-processing data-streaming pyspark python spark time-series
Last synced: 13 Jan 2025
https://github.com/20cent16/airflow-spark
If you want to use airflow with spark, ready to use ;-)
Last synced: 14 Feb 2025
https://github.com/NashTech-Labs/spark-on-mesos
deployment mesos spark word-count
Last synced: 23 Oct 2024
https://github.com/tsovak/spark-demo
The Spark REST API with Spring Boot and MongoDB
docker-compose mongodb rest-api spark sparkjava sparkrest spring-boot
Last synced: 08 Feb 2025
https://github.com/pierrekieffer/genericsupervisedmachinelearning
Generic supervised machine learning application
Last synced: 07 Feb 2025
https://github.com/sankamuk/aws-kinesis-redshift-sparkstream
Spark Structured Streaming from AWS Kinesis and Redshift
aws kinesis pyspark redshift spark structured-streaming terraform
Last synced: 13 Jan 2025
https://github.com/manojpawar94/spark-scala-examples
I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.
apache-spark spark spark-rdd spark-sql
Last synced: 13 Jan 2025
https://github.com/fbraza/data-processing-scala-spark
A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.
Last synced: 26 Jan 2025
https://github.com/same-ou/spark-hdfs-ml
Spark and HDFS cluster using Docker and Docker Compose
Last synced: 25 Dec 2024
https://github.com/exasol/spark-connector-common-java
Common library for Exasol Apache Spark based connectors
apache-spark exasol exasol-integration spark streaming
Last synced: 09 Feb 2025
https://github.com/fishercoder1534/hbaseexample
aws-s3 cluster hadoop-mapreduce hbase hive spark sparkjava
Last synced: 20 Jan 2025
https://github.com/thdaraujo/cheat
A handful of cheatsheets and programming tips.
bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop
Last synced: 24 Jan 2025
https://github.com/manuparra/clustering-openstack
Make a dynamic and customizable cluster with OpenStack
cluster deployment hadoop openstack openstack-command script slave-nodes spark
Last synced: 27 Dec 2024
https://github.com/vitalibo/distributed-heatmap-service
Simple distributed heatmap service on top of Apache HBase
aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot
Last synced: 27 Dec 2024
https://github.com/ishaansathaye/csc369-introdistributedcomputing
Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing
distributed-computing hadoop java map-reduce scala spark
Last synced: 09 Feb 2025
https://github.com/soumyadipta2020/sparkr_test
Sample Codes of Spark using R programming
r r-coding r-programming r-programming-language spark sparkr
Last synced: 05 Jan 2025
https://github.com/beiyuouo/mi-store-log-analysis
👨🦽 伪·小米商城-大数据电商日志分析
flask full-stack java kafka python spark
Last synced: 02 Feb 2025
https://github.com/iamhatesz/dend-covid19
Capstone project from Udacity's Data Engineer Nanodegree program.
airflow aws redshift spark udacity udacity-data-engineer-nanodegree udacity-nanodegree
Last synced: 13 Jan 2025
https://github.com/dodat-12/airflow-spark-job
A workspace to experiment with Apache Spark and Airflow in a Docker environment
Last synced: 20 Dec 2024
https://github.com/daixinye/zjucst
作业 & 实验
blockchain cpp hadoop iot object-oriented spark
Last synced: 20 Jan 2025
https://github.com/tonyz0x0/parallel-ml
An implementation of parallel machine learning algorithms using Spark
Last synced: 02 Feb 2025
https://github.com/ev2900/iceberg_emr_athena
Resources from an virtual tech talk / workshop - Set Up and Use Apache Iceberg Tables on Your Data Lake
apache-iceberg athena aws emr spark
Last synced: 05 Nov 2024
https://github.com/ashbyt/scala-spark
Ashley Bythell - Spark/Scala code
kmeans-clustering rdd scala spark spark-sql spark-streaming streaming
Last synced: 05 Jan 2025
https://github.com/ev2900/emr_studio_deployment
Example Jupyter notebook for EMR Studio
Last synced: 05 Nov 2024
https://github.com/e2fyi/databricks-utils
`databricks-utils` is a python package that provide several utility classes/func that improve ease-of-use in databricks notebook.
aws databricks jupyter-notebooks notebook pyspark s3 spark vega vega-lite
Last synced: 16 Jan 2025
https://github.com/bytemedirk/pyspark3-docker
PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.
aws docker docker-image python spark
Last synced: 13 Jan 2025
https://github.com/facaiy/spark-for-the-impatient
Collections of short code snippet for impatient readers who want to learn using Spark right away.
Last synced: 20 Jan 2025
https://github.com/bnvulpe/paperslab
The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.
api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis
Last synced: 09 Feb 2025
https://github.com/tuancamtbtx/java-spark-example
Spark ETL Generic Processor
Last synced: 02 Jan 2025
https://github.com/tallamjr/epfl-functional-scala
Materials and worked assignments for Functional Programming with Scala Specialization on Coursera
Last synced: 10 Feb 2025
https://github.com/tallamjr/jetspark
Spark cluster on Jetson TX2 mini-project
Last synced: 10 Feb 2025
https://github.com/drsnowbird/nlp-deeplearning-projects
NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)
chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow
Last synced: 13 Jan 2025
https://github.com/marcorfilacarreras/matemaquest
A simple API to get information of the "Pruebas Canguro" exams
api docker github-actions java math mathematics spark
Last synced: 13 Jan 2025
https://github.com/rishav273/spark-cluster-multi-node-setup
Quickly setup and simulate a multi node spark cluster using docker and docker-compose.
docker docker-compose pyspark python3 spark
Last synced: 14 Feb 2025
https://github.com/positlabs/spark-picker-animations
Animated Native UI Picker Icons in Spark AR
augmented-reality instagram spark spark-ar
Last synced: 02 Feb 2025
https://github.com/najuzilu/dl-spark
Building a Data Lake with Spark
aws-emr aws-s3 data-engineering data-lake etl-pipeline spark
Last synced: 26 Jan 2025
https://github.com/pedropark99/spark_map
Easily apply a function over multiple columns of a Spark DataFrame
Last synced: 28 Nov 2024
https://github.com/nicklitwinow/hse-python-capstone-project
This project is a comprehensive data engineering and analytics solution built using modern technologies such as Airflow, Spark, PostgreSQL, MySQL, Kafka, and Docker. It orchestrates data ingestion, processing, replication, streaming, and analytics across multiple containers.
airflow analytics dataengineering docker etl kafka mysql postgresql python spark streaming
Last synced: 03 Feb 2025
https://github.com/yjham2002/tcp_conn_with_spark
:book: none
mysql protocol redis redis-client spark tcp tcp-server
Last synced: 06 Jan 2025
https://github.com/nkdwon/crud-spark
Um CRUD feito em Java com Integração do PostgreSQL e o Framework Spark utilizando o ambiente Eclipse
eclipse-ide git java maven pgadmin4 postgresql spark
Last synced: 06 Jan 2025
https://github.com/s8sg/spark-standalone-cluster
Spark Standalone Cluster With Zookeeper
docker docker-compose spark zookeeper
Last synced: 01 Feb 2025
https://github.com/zncdatadev/spark-k8s-operator
Operator for Apache Spark-on-Kubernetes of the Kubernetes Data Stack
Last synced: 19 Nov 2024
https://github.com/mukjepscarlet/bilibili-predict-recommend
[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测
bilibili flask hadoop html javascript prediction pyspark python recommendation spark
Last synced: 18 Jan 2025
https://github.com/pzim-devdata/data-developer
All my DATA developer projects
correlation data-analysis data-mining data-science data-visualization database folium folium-maps mongodb mysql python spark sql
Last synced: 07 Feb 2025
https://github.com/hexnn/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成
clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks
Last synced: 13 Feb 2025
https://github.com/silvanheller/parquet-demo
Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV
benchmark orc parquet r scala spark university-project
Last synced: 27 Jan 2025
https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024
An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024
dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark
Last synced: 21 Jan 2025
https://github.com/tianzhipeng-git/wdsdatasource
WdsDataSource is a Spark data source implementation that allows reading and writing data in WebDataset format
Last synced: 21 Jan 2025
https://github.com/tianzonglin/bigeyes
A distributed graph computing platform that enables simple visual analysis of large-scale relational data.
canvas distributed-computing graph-drawing spark websocket
Last synced: 30 Dec 2024
https://github.com/tomfran/lastfm-users-analysis
Last FM user's data collection and analysis using Spark
Last synced: 06 Jan 2025
https://github.com/simbafl/spark-branch-2.4
源码剖析Spark2.4
spark spark-mllib spark-sql spark-streaming sparksql
Last synced: 25 Dec 2024
https://github.com/omr5221/kafka-account-fraud-detector
Learning about Kafka and Spark with project built off of an existing project
Last synced: 27 Jan 2025
https://github.com/elahe-dastan/jaraghe
How I learned Spark
count dataframe dataset spark transformations
Last synced: 14 Jan 2025
https://github.com/hpgrahsl/gab2016streamanalytics
Repository with materials for my Session at Global Azure Bootcamp 2016
azure bootcamp spark storm streamanalytics
Last synced: 08 Jan 2025
https://github.com/librity/rtjvm_spark_essentials
Rock The JVM - Apache Spark Essentials
apache-spark big-data docker scala spark spark-sql
Last synced: 08 Jan 2025
https://github.com/oracle-quickstart/oci-hortonworks
Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)
cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform
Last synced: 07 Nov 2024
https://github.com/mileszim/ember-particle
Ember service for the Particle API
api ember ember-addon ember-cli-addon iot particle particle-io spark
Last synced: 27 Jan 2025
https://github.com/easonlai/sas_access_to_adls_databricks
Using SAS to authenticate and access to ADLS Gen 2 from Azure Databricks
adlsgen2 azure azuredatabricks blob-storage blob-storage-account blobstorage data-analysis-python data-analytics databricks shared-access-signature spark
Last synced: 08 Jan 2025
https://github.com/sebastianhaeni/spark-zeppelin-docker
Docker files to run Spark and Zeppelin
Last synced: 14 Jan 2025
https://github.com/cbhihe/mesos_on_docker
Benchmark of CPU and I/O intensive operation for Mesos on Docker with Spark
benchmarking docker mapreduce mesos spark
Last synced: 02 Jan 2025
https://github.com/rdalmarco/datascience
Estudos sobre data science, big data e machine learning
estatistica pandas python r spark sql
Last synced: 03 Jan 2025
https://github.com/stefanofioravanzo/evolving-wikipedia-graph
Distributed processing of Wikipedia history files using Hadoop and Spark
distributed-processing hadoop-hdfs spark wikipedia
Last synced: 19 Jan 2025
https://github.com/oceanbase/spark-connector-oceanbase
Apache Spark Connectors for OceanBase.
apache-spark obkv obkv-hbase oceanbase spark spark-connector
Last synced: 08 Jan 2025
https://github.com/ltossian/bike-sales-data-metrics
Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.
fastapi grafana hadoop kafka postgresql python spark
Last synced: 11 Feb 2025