Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-11 00:28:31 UTC
- JSON Representation
https://github.com/abtinz/cloud-computing
cassandra cassandra-driver cloud-computing docker elasticsearch hadoop hdfs kubernetes redis spark
Last synced: 21 Jan 2025
https://github.com/danimonsalve/scala_spark
Aplicación en Scala que utiliza Apache Spark para clasificar ofertas de empleo según los lenguajes de programación mencionados en las ofertas de empleo. El objetivo es demostrar diferentes técnicas de clasificación y procesamiento de datos en grandes volúmenes de datos.
Last synced: 17 Jan 2025
https://github.com/m-molaei/twitter-sentiment-analysis-using-apache-spark-
Sentiment analysis using deep learning models and FastText embedding on Apache Spark
apache-cassandra apache-spark big-data fasttext fasttext-embeddings mongodb pyspark rdd sentiment-analysis sentiment140-dataset spark
Last synced: 21 Jan 2025
https://github.com/vubacktracking/stream-data-processing
Streaming data processing pipeline using Spark, PostgreSQL, Debezium, Kafka, Minio, Delta Lake, Trino and DBeaver
dbeaver debezium delta-lake kafka spark spark-streaming stream-processing trino
Last synced: 17 Jan 2025
https://github.com/dohabanoui/spark-structured-streaming
Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.
docker spark spark-streaming spark-structured-streaming
Last synced: 19 Jan 2025
https://github.com/chen0040/vagrant-big-data
Vagrantfiles for development in big data
cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper
Last synced: 09 Feb 2025
https://github.com/ewertondrigues02/engenharia-de-dados
Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI
airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark
Last synced: 19 Jan 2025
https://github.com/chukwuemekaaham/uber-gcp-etl-project
Data Engineering Zoomcamp Final Project
bigquery cloud-storage csv docker-compose gcp jupyter-notebook looker-studio mageai python spark spreadsheets terraform
Last synced: 10 Jan 2025
https://github.com/angeligareta/spark-hadoop-hbase-overview
First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.
apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql
Last synced: 22 Jan 2025
https://github.com/angeligareta/spark-kafka-cassandra-overview
Second lab for Data-Intensive Computing course at KTH where we use Apache Kafka, Spark, and Cassandra to practice stream processing.
apache-kafka apache-spark cassandra cassandra-server data-intensive id2221 kafka kafka-topic kth scala spark stream-processing
Last synced: 22 Jan 2025
https://github.com/chukwuemekaaham/data-engineering-zoomcamp
Datatalks Club Free Data Engineering Zoomcamp Project
bigquery dbt docker-compose duckdb gcp gcp-cloud-storage github-actions jupyter-notebook kafka linux looker-studio mageai pandas postgresql prefect python redpanda risingwave spark terraform
Last synced: 17 Jan 2025
https://github.com/angeligareta/machine-learning-spark
Assignment for Scalable Machine Learning which aims to study the basics of regression and classification in Spark.
apache-spark machine-learning scala spark spark-classification spark-ml spark-mllib spark-regression spark-scala
Last synced: 22 Jan 2025
https://github.com/tomwhite/single-cell-spark-demo
Experiments on Single Cell data from 10x Genomics using Apache Spark.
bioinformatics genomics single-cell spark
Last synced: 17 Jan 2025
https://github.com/anant/example-cassandra-spark-sql
Cassandra Data Operations with Spark SQL
cassandra data-operations docker etl spark spark-sql
Last synced: 19 Jan 2025
https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks
Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)
cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql
Last synced: 19 Jan 2025
https://github.com/brooksian/censussipp
Reprodicing Census SIPP Reports Using Apache Spark
spark sparksql zeppelin-notebook
Last synced: 19 Jan 2025
https://github.com/sebastianruizm/pyspark-graphframes
Análisis de datos con GraphFrames y PySpark
Last synced: 08 Jan 2025
https://github.com/bluegranite/azure-synapse-vcf-analysis
Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.
azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf
Last synced: 19 Jan 2025
https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency
Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.
application-insights azure azure-data-lake hdinsight java livy spark spark2
Last synced: 31 Jan 2025
https://github.com/s8sg/spark-standalone-cluster
Spark Standalone Cluster With Zookeeper
docker docker-compose spark zookeeper
Last synced: 01 Feb 2025
https://github.com/coreyauger/ashley-madison-spark
Spark data analysis for the Ashley Madison dataset.
Last synced: 16 Jan 2025
https://github.com/wadiebenabdouh/socialmedia-usage-pipeline
Data from Kaggle, containing wide range of users with different age, gender, and interest.
apache-spark data-visualization jupyter-notebook kaggle pyspark python spark
Last synced: 16 Jan 2025
https://github.com/jimthompson5802/datascience_containers
Personal docker images for various data science software stacks
data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark
Last synced: 29 Dec 2024
https://github.com/mohnoor94/learningspark
My journey to learn Spark using Scala <3
learning learning-by-doing scala spark sparkscala
Last synced: 22 Jan 2025
https://github.com/shreyas-gopalakrishna/datacenter-scale-computing
big-data docker flask hadoop kubernetes rabbitmq redis spark
Last synced: 20 Jan 2025
https://github.com/tallamjr/epfl-functional-scala
Materials and worked assignments for Functional Programming with Scala Specialization on Coursera
Last synced: 10 Feb 2025
https://github.com/tuancamtbtx/java-spark-example
Spark ETL Generic Processor
Last synced: 02 Jan 2025
https://github.com/azlinrusnan/iris_pyspark_analysis
Iris Classification using PySpark
apache pyspark-mllib python r spark
Last synced: 31 Dec 2024
https://github.com/kingyiusuen/udacity-data-engineering-nanodegree
Projects for Udacity's Data Engineering Nanodegree
airflow aws aws-athena aws-glue aws-redshift aws-s3 cassandra data-engineering spark
Last synced: 23 Jan 2025
https://github.com/antonio-f/big-data-analysis-with-scala-and-spark
Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).
big-data bigdata coursera data-analysis scala spark
Last synced: 06 Feb 2025
https://github.com/bnvulpe/paperslab
The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.
api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis
Last synced: 09 Feb 2025
https://github.com/talmago/pyspark-loglikelihood
PySpark Loglikelihood Similarity Examples
mahout pyspark recommendation-engine spark
Last synced: 03 Feb 2025
https://github.com/renardeinside/databricks-jobs-jsonnet
Example project with Databricks jobs and configuration management via jsonnet
Last synced: 06 Feb 2025
https://github.com/benitomartin/de-hotel-reviews
Data Engineering Hotel Reviews
cicd data-engineering dbt gcp jupyter-notebook looker prefect python spark sql terraform
Last synced: 31 Dec 2024
https://github.com/mauriciovazquezm/spark_bigdata_architecture_project
Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM
data-stream-processing data-streaming pyspark python spark time-series
Last synced: 13 Jan 2025
https://github.com/ralgond/bigdata-example
Hadoop、Hive和Spark的例子、细节和注意事项
bigdata hadoop hdfs hive map-reduce spark
Last synced: 09 Jan 2025
https://github.com/facaiy/spark-for-the-impatient
Collections of short code snippet for impatient readers who want to learn using Spark right away.
Last synced: 20 Jan 2025
https://github.com/imvision12/real-time-tracking
Real time bus tracking using MTA bus API
flask hadoop javascript leaflet python spark
Last synced: 08 Feb 2025
https://github.com/tadod12/airflow-spark-job
A workspace to experiment with Apache Spark and Airflow in a Docker environment
Last synced: 13 Jan 2025
https://github.com/pranavshashidhara/movie-recommendation-system
This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).
big-data recommendation-system spark
Last synced: 13 Jan 2025
https://github.com/declaredata/fuse_python
PySpark-compatible Python client for DeclareData Fuse Server: a blazing fast data processing engine and drop-in alternative to Spark clusters.
data-processing pyspark rust-lang spark
Last synced: 13 Jan 2025
https://github.com/firefly55lm/superconductors_critical_temperature_analysis
Academic project for Big Data Laboratory
chemistry docker machine-learning physics pyspark spark
Last synced: 25 Jan 2025
https://github.com/naramsim/dynamic-twitter-geographical-categorization
A map-reduce implementation for the categorization of Twitter tweets within dynamic geographical boundaries.
Last synced: 26 Dec 2024
https://github.com/ac-gomes/systemctl_spark_jupyter-notebook
systemctl for Spark and Jupyter-notebook
jupyter-notebook spark systemctl systemd
Last synced: 02 Jan 2025
https://github.com/euiyounghwang/spark_job_interface_service
spark_job_interface_service
fastapi spark spark-cluster spark-jobs
Last synced: 17 Jan 2025
https://github.com/code-help-tutor/spark-assignment
spark 代写代做 编程辅导, code help, CS tutor, WeChat: cstutorcs Email: [email protected]
Last synced: 17 Jan 2025
https://github.com/matz1979/spark-etl-pipelines
My final project with big data build with Spark
bigdata datalake etl-pipeline python spark
Last synced: 10 Jan 2025
https://github.com/hienduyph/hienph.dev
My Notes
airflow big-data data-engineering spark
Last synced: 03 Jan 2025
https://github.com/tomwhite/sparklyr-mini-regression
machine-learning regression spark sparklyr-extension
Last synced: 17 Jan 2025
https://github.com/ineerav/tfidf-map-reduce
Running Tf-Idf using spark streaming on hillary clinton's infamous leaked email data set https://www.kaggle.com/datasets/kaggle/hillary-clinton-emails
aws emr maven pig-latin shell spark spring-boot tf-idf
Last synced: 17 Jan 2025
https://github.com/ineerav/eda-spark-elasticsearch
Data analysis using pyspark, spark streaming, apache hive, AWS Elastic MapReduce cluster and elasticsearch dashboard hosting with Google Cloud Storage service connectors
aws aws-glue cloudformation elasticsearch elasticsearch-client emr ethena gcp hive pig python spark spark-streaming
Last synced: 17 Jan 2025
https://github.com/manuel-lang/data-lake-with-spark
Project Data Lake as part of Udacity's Data Engineering Nanodegree
data-engineering data-lake etl-pipeline s3 spark udacity udacity-data-engineer-nanodegree
Last synced: 12 Jan 2025
https://github.com/geloodev/rpg-character-sheet-old
(OLD) RPG Character Sheet made with Java, Spark and Hibernate.
character-sheet hibernate-orm java rpg spark
Last synced: 03 Jan 2025
https://github.com/chimera-suite/thriftserver
Apache Thrift Server exposes a SparkSQL JDBC/ODBC endpoint.
jdbc spark sparksql sql thrift-server
Last synced: 03 Jan 2025
https://github.com/mightypixel/mightylab
A collection of small projects in the field of the data science.
concept data-science machine-learning python spark study
Last synced: 23 Jan 2025
https://github.com/jedirhymetrix/cosc-6339-hw3
amazon-reviews attention-mechanism bert bidirectional-lstm big-data cla deep-learning distilbert-model fine-tuning-bert lstm natural-language-processing nlp pyspark python sentiment-classification spark spark-ml transfer-learning transformers word2vec
Last synced: 04 Jan 2025
https://github.com/diegoribeiro2/analise_de_transcoes_pix_para_deteccao_de_fraudes_com_pyspark-_big_data
Case prático de análise de transações PIX com o objetivo de detectar fraudes, desde a coleta e entendimento dos dados até a modelagem de um algoritmo de detecção de fraudes.
Last synced: 04 Jan 2025
https://github.com/akaliutau/k8s-spark-operator
Harnessing Spark Operator in K8s cluster
docker helm-charts kuberentes spark spark-operator
Last synced: 11 Jan 2025
https://github.com/pierrekieffer/genericunsupervisedmachinelearning
Generic Clustering algorithm for Apache Spark deployment
kmeans machine-learning mllib silhouette spark
Last synced: 07 Feb 2025
https://github.com/lupusruber/rnmp_homework2
A recommendation system project that uses the Spark MLlib's ALS model to train and evaluate on the MovieLens dataset. Includes Dockerized setup, hyperparameter tuning, and evaluation metrics (RMSE, Precision@K, Recall@K, NDCG) for performance insights.
docker mllib recommender-system spark
Last synced: 09 Feb 2025
https://github.com/danieldacosta/etl-spark-stepfunctions
ETL pipeline using Spark on EMR cluster and Step functions for orchestrations.
aws aws-step-functions etl spark
Last synced: 11 Jan 2025
https://github.com/danieldacosta/etl-spark-parallel-stepfunctions
Execute EMR Jobs in parallel
Last synced: 11 Jan 2025
https://github.com/ophiase/big-data-project-ifeby310
Analysis website of the New York Shared Bike systems (Citibikes 🚲️) dataset. Extract Load Transform using pyspark in parquet format.
Last synced: 19 Jan 2025
https://github.com/kometen/parsexml
Run Databricks XML-parser from command line.
databricks sbt scala spark xml-parser
Last synced: 11 Jan 2025
https://github.com/lawal-hash/dataeng
dbt docker docker-compose kafka marge postgresql spark terraform
Last synced: 11 Jan 2025
https://github.com/rongfengliang/spark-k8s-deploy
spark-k8s-deploy
big-data docker kubernetes spark
Last synced: 11 Jan 2025
https://github.com/jeet1995/spark-stock-trading-simulator
This project aims at leveraging Apache's Spark framework to simulate possible profits and losses for a given portfolio and an initial investment value. The investment pattern on this portfolio is randomized with the help of Monte-Carlo simulations.
Last synced: 04 Jan 2025
https://github.com/sjtufl/entroanomaly
Mining anomalies using traffic feature distuibutions
Last synced: 18 Jan 2025
https://github.com/latiefdatavisionary/datasea-spark-itb-2025
data-science-competition datathon itb spark
Last synced: 24 Jan 2025
https://github.com/yasarsultan/taxi-trip-analysis
The NYC Taxi Trip Batch Data Pipeline automates processing of large-scale trip data using Apache Spark and Airflow, integrating AWS S3 and Google BigQuery for storage and analytics. It features scalable, containerized workflows with robust data validation.
airflow aws-s3 bash-script batch-processing bigquery data-lake data-warehouse docker python3 spark
Last synced: 11 Jan 2025
https://github.com/dueyfinster/pluralsight
Course Examples from Pluralsight
java kafka kubernetes python spark
Last synced: 12 Jan 2025
https://github.com/dougdss89/wideworldadventure
This repository includes all files that compose the design and unification of the databases AdventureWorks and WideWorldAdventure project.
bigdata databricks datalake datawarehouse dbt deltalake duckdb elt etl etl-pipeline spark
Last synced: 30 Jan 2025
https://github.com/ynazymko12/goit-de-hw-03
Homework for Data Engineering course
Last synced: 30 Jan 2025
https://github.com/s-yazhini/pyspark-and-sparksql
In Azure DataBricks
azure-databricks cluster-analysis pyspark spark spark-sql
Last synced: 30 Jan 2025
https://github.com/riccardorevalor/spark
Spark exercises
pyspark spark spark-rdd spark-sql
Last synced: 30 Jan 2025
https://github.com/abdelhaqs/pyspark_advanced_dataframe_concepts
This project provides a Docker-based setup to explore advanced PySpark DataFrame concepts using Jupyter notebooks. The environment includes all necessary dependencies, making it easy to get started with PySpark for data processing and analysis.
Last synced: 30 Jan 2025
https://github.com/denisogr/kaggle-notebook-to-production
This is a study project. I get analytics/ML examples from Kaggle and use different technologies to re-implement them.
bigquery data-engineering gcp kaggle-competition kaggle-dataset python spark
Last synced: 12 Jan 2025
https://github.com/rupeshtr78/mqttspark
IOT Device MQTT Spark Streaming
cassandra gpio iot mqtt mqtt-broker mqtt-client raspberry-pi spark spark-streaming yarn
Last synced: 12 Jan 2025
https://github.com/rupeshtr78/machine_learning
Machine Learning TensorFlow Neural Networks Deep Learning
classification data-analysis deep-learning deep-neural-networks flink jupyter-notebook keras machine-learning machinelearning-python perceptron python3 spark tensorflow
Last synced: 12 Jan 2025
https://github.com/rupeshtr78/bigdata
elasticsearch hive hue kafka kafka-streams presto presto-cassandra-hive spark spark-hdfs-hive spark-streaming
Last synced: 12 Jan 2025
https://github.com/rupeshtr78/blog
Big Data Spark Hadoop Kafka Flink Spark Streaming
aws bigdata cassandra elasticsearch emr-cluster flink hadoop hive hue kafka mapreduce mongodb oozie spark sparkstreaming yarn
Last synced: 12 Jan 2025
https://github.com/margaretkhendre/home-sales-vs-big-data
In this repository, Google Collab is paired with SparkSQL to determine key metrics about home sales data. Spark is also used to create temporary views, partition data, and cache/unchache a temporary table in the process.
big-data googlecollab ipynb-jupyter-notebook pyspark spark sparksql sql
Last synced: 12 Jan 2025
https://github.com/isabeljohnson001/yelp-customer-reviews-sentiment-analysis-data-infra
Yelp_Customer_Reviews_Sentiment_Analysis
docker elasticsearch kafka kibana mongodb nosql-database python spark
Last synced: 12 Jan 2025
https://github.com/tim6her/kaggle_disaster_detection
My submission to the Advanced Data Science Capstone course by IBM hosted by Coursera
kaggle-competition nlp-machine-learning python3 spark
Last synced: 20 Jan 2025