Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-10 00:27:59 UTC
- JSON Representation
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 12 Jan 2025
https://github.com/extwiii/bigdata-uc.san.diego
Unlock Value in Massive Datasets - UC San Diego
big-data classification data-science graph hadoop integration machine-learning management modeling neo4j processing regression spark
Last synced: 28 Jan 2025
https://github.com/multivacplatform/multivac-ml
Pre-trained ML models for Apache Spark
machine-learning nlp spark spark-ml
Last synced: 12 Jan 2025
https://github.com/gacwr/openuba-model-hub
frontend, model registry, model search, and model marketplace for OpenUBA
analytics anomaly-detection cybersecurity datascience elasticsearch elk flask information-security machine-learning security siem sklearn spark tensorflow threathunting uba ueba user-behaviour
Last synced: 15 Jan 2025
https://github.com/ltossian/bike-sales-data-metrics
Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.
fastapi grafana hadoop kafka postgresql python spark
Last synced: 11 Oct 2024
https://github.com/georgegkonis/spark-decentralized-query-processing
Project for the academic course "Decentralized Data Technologies"
big-data decentralized-data jupyter python query-optimization spark
Last synced: 19 Dec 2024
https://github.com/jf17/zeppelin-examples
big-data scala spark zeppelin-notebook
Last synced: 12 Jan 2025
https://github.com/chucheng92/sparkstreamingkafka
Spark Streaming logs to kafka.
kafka spark spark-streaming streaming
Last synced: 01 Feb 2025
https://github.com/felixcheung/spark-build
Build Apache Spark
apache-spark docker-image dockerfile spark
Last synced: 01 Feb 2025
https://github.com/michelderu/cassandra-csv-analytics
How to leverage Astra, DSE and Spark for analytics on large CSV files.
Last synced: 20 Jan 2025
https://github.com/declaredata/fuse_python
PySpark-compatible Python client for DeclareData Fuse Server: a blazing fast data processing engine and drop-in alternative to Spark clusters.
data-processing pyspark rust-lang spark
Last synced: 13 Jan 2025
https://github.com/pranavshashidhara/movie-recommendation-system
This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).
big-data recommendation-system spark
Last synced: 13 Jan 2025
https://github.com/tadod12/airflow-spark-job
A workspace to experiment with Apache Spark and Airflow in a Docker environment
Last synced: 13 Jan 2025
https://github.com/mauriciovazquezm/spark_bigdata_architecture_project
Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM
data-stream-processing data-streaming pyspark python spark time-series
Last synced: 13 Jan 2025
https://github.com/20cent16/airflow-spark
If you want to use airflow with spark, ready to use ;-)
Last synced: 11 Oct 2024
https://github.com/NashTech-Labs/spark-on-mesos
deployment mesos spark word-count
Last synced: 23 Oct 2024
https://github.com/tsovak/spark-demo
The Spark REST API with Spring Boot and MongoDB
docker-compose mongodb rest-api spark sparkjava sparkrest spring-boot
Last synced: 08 Feb 2025
https://github.com/pierrekieffer/genericsupervisedmachinelearning
Generic supervised machine learning application
Last synced: 07 Feb 2025
https://github.com/sankamuk/aws-kinesis-redshift-sparkstream
Spark Structured Streaming from AWS Kinesis and Redshift
aws kinesis pyspark redshift spark structured-streaming terraform
Last synced: 13 Jan 2025
https://github.com/manojpawar94/spark-scala-examples
I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.
apache-spark spark spark-rdd spark-sql
Last synced: 13 Jan 2025
https://github.com/fbraza/data-processing-scala-spark
A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.
Last synced: 26 Jan 2025
https://github.com/same-ou/spark-hdfs-ml
Spark and HDFS cluster using Docker and Docker Compose
Last synced: 25 Dec 2024
https://github.com/exasol/spark-connector-common-java
Common library for Exasol Apache Spark based connectors
apache-spark exasol exasol-integration spark streaming
Last synced: 09 Feb 2025
https://github.com/fishercoder1534/hbaseexample
aws-s3 cluster hadoop-mapreduce hbase hive spark sparkjava
Last synced: 20 Jan 2025
https://github.com/thdaraujo/cheat
A handful of cheatsheets and programming tips.
bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop
Last synced: 24 Jan 2025
https://github.com/manuparra/clustering-openstack
Make a dynamic and customizable cluster with OpenStack
cluster deployment hadoop openstack openstack-command script slave-nodes spark
Last synced: 27 Dec 2024
https://github.com/vitalibo/distributed-heatmap-service
Simple distributed heatmap service on top of Apache HBase
aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot
Last synced: 27 Dec 2024
https://github.com/ishaansathaye/csc369-introdistributedcomputing
Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing
distributed-computing hadoop java map-reduce scala spark
Last synced: 09 Feb 2025
https://github.com/soumyadipta2020/sparkr_test
Sample Codes of Spark using R programming
r r-coding r-programming r-programming-language spark sparkr
Last synced: 05 Jan 2025
https://github.com/beiyuouo/mi-store-log-analysis
👨🦽 伪·小米商城-大数据电商日志分析
flask full-stack java kafka python spark
Last synced: 02 Feb 2025
https://github.com/iamhatesz/dend-covid19
Capstone project from Udacity's Data Engineer Nanodegree program.
airflow aws redshift spark udacity udacity-data-engineer-nanodegree udacity-nanodegree
Last synced: 13 Jan 2025
https://github.com/dodat-12/airflow-spark-job
A workspace to experiment with Apache Spark and Airflow in a Docker environment
Last synced: 20 Dec 2024
https://github.com/daixinye/zjucst
作业 & 实验
blockchain cpp hadoop iot object-oriented spark
Last synced: 20 Jan 2025
https://github.com/tonyz0x0/parallel-ml
An implementation of parallel machine learning algorithms using Spark
Last synced: 02 Feb 2025
https://github.com/ev2900/iceberg_emr_athena
Resources from an virtual tech talk / workshop - Set Up and Use Apache Iceberg Tables on Your Data Lake
apache-iceberg athena aws emr spark
Last synced: 05 Nov 2024
https://github.com/ashbyt/scala-spark
Ashley Bythell - Spark/Scala code
kmeans-clustering rdd scala spark spark-sql spark-streaming streaming
Last synced: 05 Jan 2025
https://github.com/ev2900/emr_studio_deployment
Example Jupyter notebook for EMR Studio
Last synced: 05 Nov 2024
https://github.com/e2fyi/databricks-utils
`databricks-utils` is a python package that provide several utility classes/func that improve ease-of-use in databricks notebook.
aws databricks jupyter-notebooks notebook pyspark s3 spark vega vega-lite
Last synced: 16 Jan 2025
https://github.com/bytemedirk/pyspark3-docker
PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.
aws docker docker-image python spark
Last synced: 13 Jan 2025
https://github.com/facaiy/spark-for-the-impatient
Collections of short code snippet for impatient readers who want to learn using Spark right away.
Last synced: 20 Jan 2025
https://github.com/bnvulpe/paperslab
The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.
api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis
Last synced: 09 Feb 2025
https://github.com/tuancamtbtx/java-spark-example
Spark ETL Generic Processor
Last synced: 02 Jan 2025
https://github.com/tallamjr/epfl-functional-scala
Materials and worked assignments for Functional Programming with Scala Specialization on Coursera
Last synced: 10 Feb 2025
https://github.com/tallamjr/jetspark
Spark cluster on Jetson TX2 mini-project
Last synced: 10 Feb 2025
https://github.com/shayartt/streaming-orders
Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS
databricks etl kafka python spark spark-streaming
Last synced: 12 Oct 2024
https://github.com/drsnowbird/nlp-deeplearning-projects
NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)
chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow
Last synced: 13 Jan 2025
https://github.com/marcorfilacarreras/matemaquest
A simple API to get information of the "Pruebas Canguro" exams
api docker github-actions java math mathematics spark
Last synced: 13 Jan 2025
https://github.com/rishav273/spark-cluster-multi-node-setup
Quickly setup and simulate a multi node spark cluster using docker and docker-compose.
docker docker-compose pyspark python3 spark
Last synced: 11 Oct 2024
https://github.com/f-lab-edu/league-of-legends-data-solution
‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.
Last synced: 11 Oct 2024
https://github.com/tuancamtbtx/spark-build-tool
Generate Spark Job From This Tool
Last synced: 11 Oct 2024
https://github.com/fsanaulla/spark-http-rdd
RDD primitive for fetching data from an HTTP source
Last synced: 12 Oct 2024
https://github.com/positlabs/spark-picker-animations
Animated Native UI Picker Icons in Spark AR
augmented-reality instagram spark spark-ar
Last synced: 02 Feb 2025
https://github.com/najuzilu/dl-spark
Building a Data Lake with Spark
aws-emr aws-s3 data-engineering data-lake etl-pipeline spark
Last synced: 26 Jan 2025
https://github.com/pedropark99/spark_map
Easily apply a function over multiple columns of a Spark DataFrame
Last synced: 28 Nov 2024
https://github.com/nicklitwinow/hse-python-capstone-project
This project is a comprehensive data engineering and analytics solution built using modern technologies such as Airflow, Spark, PostgreSQL, MySQL, Kafka, and Docker. It orchestrates data ingestion, processing, replication, streaming, and analytics across multiple containers.
airflow analytics dataengineering docker etl kafka mysql postgresql python spark streaming
Last synced: 03 Feb 2025
https://github.com/yjham2002/tcp_conn_with_spark
:book: none
mysql protocol redis redis-client spark tcp tcp-server
Last synced: 06 Jan 2025
https://github.com/nkdwon/crud-spark
Um CRUD feito em Java com Integração do PostgreSQL e o Framework Spark utilizando o ambiente Eclipse
eclipse-ide git java maven pgadmin4 postgresql spark
Last synced: 06 Jan 2025
https://github.com/s8sg/spark-standalone-cluster
Spark Standalone Cluster With Zookeeper
docker docker-compose spark zookeeper
Last synced: 01 Feb 2025
https://github.com/zncdatadev/spark-k8s-operator
Operator for Apache Spark-on-Kubernetes of the Kubernetes Data Stack
Last synced: 19 Nov 2024
https://github.com/starhe/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成
clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks
Last synced: 21 Dec 2024
https://github.com/mukjepscarlet/bilibili-predict-recommend
[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测
bilibili flask hadoop html javascript prediction pyspark python recommendation spark
Last synced: 18 Jan 2025
https://github.com/pzim-devdata/data-developer
All my DATA developer projects
correlation data-analysis data-mining data-science data-visualization database folium folium-maps mongodb mysql python spark sql
Last synced: 07 Feb 2025
https://github.com/sunsided/spark-atlas
Spark vs. MongoDB Atlas
data-processing docker jupyter-notebook mongodb mongodb-atlas pyspark python spark
Last synced: 20 Dec 2024
https://github.com/jimthompson5802/datascience_containers
Personal docker images for various data science software stacks
data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark
Last synced: 29 Dec 2024
https://github.com/hexnn/balm
基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成
clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks
Last synced: 21 Dec 2024
https://github.com/fanqingsong/machine_learning_system_on_spark
a simple machine learning system demo(cluster and predict on iris data), for ML study. Based on machine_learning_system repo, add new process for ml model service with celery and spark.
celery django machine-learning reactjs spark
Last synced: 21 Dec 2024
https://github.com/silvanheller/parquet-demo
Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV
benchmark orc parquet r scala spark university-project
Last synced: 27 Jan 2025
https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024
An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024
dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark
Last synced: 21 Jan 2025
https://github.com/tianzhipeng-git/wdsdatasource
WdsDataSource is a Spark data source implementation that allows reading and writing data in WebDataset format
Last synced: 21 Jan 2025
https://github.com/tianzonglin/bigeyes
A distributed graph computing platform that enables simple visual analysis of large-scale relational data.
canvas distributed-computing graph-drawing spark websocket
Last synced: 30 Dec 2024
https://github.com/tomfran/lastfm-users-analysis
Last FM user's data collection and analysis using Spark
Last synced: 06 Jan 2025
https://github.com/simbafl/spark-branch-2.4
源码剖析Spark2.4
spark spark-mllib spark-sql spark-streaming sparksql
Last synced: 25 Dec 2024
https://github.com/omr5221/kafka-account-fraud-detector
Learning about Kafka and Spark with project built off of an existing project
Last synced: 27 Jan 2025
https://github.com/elahe-dastan/jaraghe
How I learned Spark
count dataframe dataset spark transformations
Last synced: 14 Jan 2025
https://github.com/hpgrahsl/gab2016streamanalytics
Repository with materials for my Session at Global Azure Bootcamp 2016
azure bootcamp spark storm streamanalytics
Last synced: 08 Jan 2025
https://github.com/librity/rtjvm_spark_essentials
Rock The JVM - Apache Spark Essentials
apache-spark big-data docker scala spark spark-sql
Last synced: 08 Jan 2025
https://github.com/oracle-quickstart/oci-hortonworks
Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)
cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform
Last synced: 07 Nov 2024
https://github.com/mileszim/ember-particle
Ember service for the Particle API
api ember ember-addon ember-cli-addon iot particle particle-io spark
Last synced: 27 Jan 2025
https://github.com/easonlai/sas_access_to_adls_databricks
Using SAS to authenticate and access to ADLS Gen 2 from Azure Databricks
adlsgen2 azure azuredatabricks blob-storage blob-storage-account blobstorage data-analysis-python data-analytics databricks shared-access-signature spark
Last synced: 08 Jan 2025
https://github.com/sebastianhaeni/spark-zeppelin-docker
Docker files to run Spark and Zeppelin
Last synced: 14 Jan 2025
https://github.com/cbhihe/mesos_on_docker
Benchmark of CPU and I/O intensive operation for Mesos on Docker with Spark
benchmarking docker mapreduce mesos spark
Last synced: 02 Jan 2025