Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

GitHub: https://github.com/topics/spark
Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
Repo: https://github.com/apache/spark
Created by: Matei Zaharia
Released: May 26, 2014
Related Topics: scala, hadoop,
Aliases: apache-spark,
Last updated: 2025-02-11 00:28:31 UTC
JSON Representation

https://github.com/abtinz/cloud-computing

cassandra cassandra-driver cloud-computing docker elasticsearch hadoop hdfs kubernetes redis spark

Last synced: 21 Jan 2025

https://github.com/danimonsalve/scala_spark

Aplicación en Scala que utiliza Apache Spark para clasificar ofertas de empleo según los lenguajes de programación mencionados en las ofertas de empleo. El objetivo es demostrar diferentes técnicas de clasificación y procesamiento de datos en grandes volúmenes de datos.

rdd scala spark

Last synced: 17 Jan 2025

https://github.com/vubacktracking/hdfs-stream-processing

Streaming data processing using Hadoop HDFS, Spark, Kafka, Minio, Elasticsearch

airflow elastic hadoop hdfs kafka kibana minio spark

Last synced: 11 Oct 2024

https://github.com/m-molaei/twitter-sentiment-analysis-using-apache-spark-

Sentiment analysis using deep learning models and FastText embedding on Apache Spark

apache-cassandra apache-spark big-data fasttext fasttext-embeddings mongodb pyspark rdd sentiment-analysis sentiment140-dataset spark

Last synced: 21 Jan 2025

https://github.com/kemalcanbora/foo_scientist

big-data json pyspark python sh shell spark

Last synced: 24 Jan 2025

https://github.com/morinian/pyspark

Estudos com PySpark

spark

Last synced: 18 Jan 2025

https://github.com/adrianmarino/spark-examples

Spark install & examples

notebooks python spark

Last synced: 24 Jan 2025

https://github.com/vubacktracking/stream-data-processing

Streaming data processing pipeline using Spark, PostgreSQL, Debezium, Kafka, Minio, Delta Lake, Trino and DBeaver

dbeaver debezium delta-lake kafka spark spark-streaming stream-processing trino

Last synced: 17 Jan 2025

https://github.com/dohabanoui/spark-structured-streaming

Real-time analysis of hospital incident data using Apache Spark Streaming to track incidents by service and identify the top years with the most incidents.

docker spark spark-streaming spark-structured-streaming

Last synced: 19 Jan 2025

https://github.com/chen0040/vagrant-big-data

Vagrantfiles for development in big data

cassandra elasticsearc hdfs kafka mesos redis spark storm vagrantfile zookeeper

Last synced: 09 Feb 2025

https://github.com/ewertondrigues02/engenharia-de-dados

Varios Projetos de Engenharia de Dados usando principais ferramentas como: Airflow, Snowflake, dbt, Postrgres, Looker Studio, Power BI

airflow analise-exploratoria analytics aws-ec2 dados data dbt-cloud engenharia-de-dados looker-studio postgres pyspark python3 snowflake spark

Last synced: 19 Jan 2025

https://github.com/chukwuemekaaham/uber-gcp-etl-project

Data Engineering Zoomcamp Final Project

bigquery cloud-storage csv docker-compose gcp jupyter-notebook looker-studio mageai python spark spreadsheets terraform

Last synced: 10 Jan 2025

https://github.com/angeligareta/spark-hadoop-hbase-overview

First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.

apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql

Last synced: 22 Jan 2025

https://github.com/angeligareta/spark-kafka-cassandra-overview

Second lab for Data-Intensive Computing course at KTH where we use Apache Kafka, Spark, and Cassandra to practice stream processing.

apache-kafka apache-spark cassandra cassandra-server data-intensive id2221 kafka kafka-topic kth scala spark stream-processing

Last synced: 22 Jan 2025

https://github.com/chukwuemekaaham/data-engineering-zoomcamp

Datatalks Club Free Data Engineering Zoomcamp Project

bigquery dbt docker-compose duckdb gcp gcp-cloud-storage github-actions jupyter-notebook kafka linux looker-studio mageai pandas postgresql prefect python redpanda risingwave spark terraform

Last synced: 17 Jan 2025

https://github.com/angeligareta/machine-learning-spark

Assignment for Scalable Machine Learning which aims to study the basics of regression and classification in Spark.

apache-spark machine-learning scala spark spark-classification spark-ml spark-mllib spark-regression spark-scala

Last synced: 22 Jan 2025

https://github.com/tomwhite/single-cell-spark-demo

Experiments on Single Cell data from 10x Genomics using Apache Spark.

bioinformatics genomics single-cell spark

Last synced: 17 Jan 2025

https://github.com/anant/playbook

Anant Platform Playbook - Consists of principles, patterns, tools, a framework, and an approach to designing, building, and managing plaforms.

approach cassandra confluent datastax framework kafka platform playbook spark

Last synced: 19 Jan 2025

https://github.com/anant/example-cassandra-spark-sql

Cassandra Data Operations with Spark SQL

cassandra data-operations docker etl spark spark-sql

Last synced: 19 Jan 2025

https://github.com/anant/example-sql-on-cassandra-with-open-source-notebooks

Files to follow along with the Open Source Notebooks and Cassandra Webinar (see README.md)

cassandra datastax datastax-studio jupyter jupyter-notebook nosql notebooks quix spark sql

Last synced: 19 Jan 2025

https://github.com/brooksian/censussipp

Reprodicing Census SIPP Reports Using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 19 Jan 2025

https://github.com/sebastianruizm/pyspark-graphframes

Análisis de datos con GraphFrames y PySpark

python spark sql

Last synced: 08 Jan 2025

https://github.com/bluegranite/azure-synapse-vcf-analysis

Sample code for analyzing VCF files (converted to Parquet) in Azure Databricks and Synapse.

azure azure-databricks azure-synapse bioinformatics computational-biology databricks genomics glow parquet spark synapse vcf

Last synced: 19 Jan 2025

https://github.com/adampaternostro/azure-spark-livy-application-insights-external-dependency

Use Spark with Livy along with Application Insights. Learn to host your external dependencies in data lake.

application-insights azure azure-data-lake hdinsight java livy spark spark2

Last synced: 31 Jan 2025

https://github.com/s8sg/spark-standalone-cluster

Spark Standalone Cluster With Zookeeper

docker docker-compose spark zookeeper

Last synced: 01 Feb 2025

https://github.com/coreyauger/ashley-madison-spark

Spark data analysis for the Ashley Madison dataset.

scala spark

Last synced: 16 Jan 2025

https://github.com/wadiebenabdouh/socialmedia-usage-pipeline

Data from Kaggle, containing wide range of users with different age, gender, and interest.

apache-spark data-visualization jupyter-notebook kaggle pyspark python spark

Last synced: 16 Jan 2025

https://github.com/jimthompson5802/datascience_containers

Personal docker images for various data science software stacks

data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark

Last synced: 29 Dec 2024

https://github.com/mohnoor94/learningspark

My journey to learn Spark using Scala <3

learning learning-by-doing scala spark sparkscala

Last synced: 22 Jan 2025

https://github.com/shreyas-gopalakrishna/datacenter-scale-computing

big-data docker flask hadoop kubernetes rabbitmq redis spark

Last synced: 20 Jan 2025

https://github.com/tallamjr/epfl-functional-scala

Materials and worked assignments for Functional Programming with Scala Specialization on Coursera

big-data scala spark

Last synced: 10 Feb 2025

https://github.com/tuancamtbtx/java-spark-example

Spark ETL Generic Processor

etl spark

Last synced: 02 Jan 2025

https://github.com/azlinrusnan/iris_pyspark_analysis

Iris Classification using PySpark

apache pyspark-mllib python r spark

Last synced: 31 Dec 2024

https://github.com/kingyiusuen/udacity-data-engineering-nanodegree

Projects for Udacity's Data Engineering Nanodegree

airflow aws aws-athena aws-glue aws-redshift aws-s3 cassandra data-engineering spark

Last synced: 23 Jan 2025

https://github.com/antonio-f/big-data-analysis-with-scala-and-spark

Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).

big-data bigdata coursera data-analysis scala spark

Last synced: 06 Feb 2025

https://github.com/cclient/spark-lda-example

lda spark

Last synced: 16 Jan 2025

https://github.com/bnvulpe/paperslab

The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.

api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis

Last synced: 09 Feb 2025

https://github.com/talmago/pyspark-loglikelihood

PySpark Loglikelihood Similarity Examples

mahout pyspark recommendation-engine spark

Last synced: 03 Feb 2025

https://github.com/thom-x/graalvm-java-docker

Example creating native image of java app with GraalVM Gradle and Docker

docker graalvm gradle java spark

Last synced: 25 Jan 2025

https://github.com/renardeinside/databricks-jobs-jsonnet

Example project with Databricks jobs and configuration management via jsonnet

databricks jsonnet spark

Last synced: 06 Feb 2025

https://github.com/benitomartin/ibm-advanced-ds-capstone

exploratory-data-analysis jupyter-notebook python random-forest spark

Last synced: 31 Dec 2024

https://github.com/benitomartin/de-hotel-reviews

Data Engineering Hotel Reviews

cicd data-engineering dbt gcp jupyter-notebook looker prefect python spark sql terraform

Last synced: 31 Dec 2024

https://github.com/mauriciovazquezm/spark_bigdata_architecture_project

Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM

data-stream-processing data-streaming pyspark python spark time-series

Last synced: 13 Jan 2025

https://github.com/ralgond/bigdata-example

Hadoop、Hive和Spark的例子、细节和注意事项

bigdata hadoop hdfs hive map-reduce spark

Last synced: 09 Jan 2025

https://github.com/facaiy/spark-for-the-impatient

Collections of short code snippet for impatient readers who want to learn using Spark right away.

spark spark-training tutorial

Last synced: 20 Jan 2025

https://github.com/imransilvake/semantic-partitioning

RDF Data (N-Triples) Partition and SPARQL Query Layer for SANSA-Stack using Scala and Spark.

big-data n-triples scala spark sparql

Last synced: 31 Jan 2025

https://github.com/imvision12/real-time-tracking

Real time bus tracking using MTA bus API

flask hadoop javascript leaflet python spark

Last synced: 08 Feb 2025

https://github.com/tadod12/airflow-spark-job

A workspace to experiment with Apache Spark and Airflow in a Docker environment

airflow docker rdbms spark

Last synced: 13 Jan 2025

https://github.com/pranavshashidhara/movie-recommendation-system

This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).

big-data recommendation-system spark

Last synced: 13 Jan 2025

https://github.com/declaredata/fuse_python

PySpark-compatible Python client for DeclareData Fuse Server: a blazing fast data processing engine and drop-in alternative to Spark clusters.

data-processing pyspark rust-lang spark

Last synced: 13 Jan 2025

https://github.com/firefly55lm/superconductors_critical_temperature_analysis

Academic project for Big Data Laboratory

chemistry docker machine-learning physics pyspark spark

Last synced: 25 Jan 2025

https://github.com/naramsim/dynamic-twitter-geographical-categorization

A map-reduce implementation for the categorization of Twitter tweets within dynamic geographical boundaries.

redis spark twitter

Last synced: 26 Dec 2024

https://github.com/ac-gomes/systemctl_spark_jupyter-notebook

systemctl for Spark and Jupyter-notebook

jupyter-notebook spark systemctl systemd

Last synced: 02 Jan 2025

https://github.com/euiyounghwang/spark_job_interface_service

spark_job_interface_service

fastapi spark spark-cluster spark-jobs

Last synced: 17 Jan 2025

https://github.com/fa3001/realtime_election_voting_system

Realtime Election Voting System.

docker kafka postgres spark streamlit

Last synced: 09 Feb 2025

https://github.com/code-help-tutor/spark-assignment

spark 代写代做编程辅导, code help, CS tutor, WeChat: cstutorcs Email: [email protected]

spark

Last synced: 17 Jan 2025

https://github.com/matz1979/spark-etl-pipelines

My final project with big data build with Spark

bigdata datalake etl-pipeline python spark

Last synced: 10 Jan 2025

https://github.com/hienduyph/hienph.dev

My Notes

airflow big-data data-engineering spark

Last synced: 03 Jan 2025

https://github.com/tomwhite/sparklyr-mini-regression

machine-learning regression spark sparklyr-extension

Last synced: 17 Jan 2025

https://github.com/ineerav/tfidf-map-reduce

Running Tf-Idf using spark streaming on hillary clinton's infamous leaked email data set https://www.kaggle.com/datasets/kaggle/hillary-clinton-emails

aws emr maven pig-latin shell spark spring-boot tf-idf

Last synced: 17 Jan 2025

https://github.com/ineerav/eda-spark-elasticsearch

Data analysis using pyspark, spark streaming, apache hive, AWS Elastic MapReduce cluster and elasticsearch dashboard hosting with Google Cloud Storage service connectors

aws aws-glue cloudformation elasticsearch elasticsearch-client emr ethena gcp hive pig python spark spark-streaming

Last synced: 17 Jan 2025

https://github.com/manuel-lang/data-lake-with-spark

Project Data Lake as part of Udacity's Data Engineering Nanodegree

data-engineering data-lake etl-pipeline s3 spark udacity udacity-data-engineer-nanodegree

Last synced: 12 Jan 2025

https://github.com/geloodev/rpg-character-sheet-old

(OLD) RPG Character Sheet made with Java, Spark and Hibernate.

character-sheet hibernate-orm java rpg spark

Last synced: 03 Jan 2025

https://github.com/chimera-suite/thriftserver

Apache Thrift Server exposes a SparkSQL JDBC/ODBC endpoint.

jdbc spark sparksql sql thrift-server

Last synced: 03 Jan 2025

https://github.com/mightypixel/mightylab

A collection of small projects in the field of the data science.

concept data-science machine-learning python spark study

Last synced: 23 Jan 2025

https://github.com/jedirhymetrix/cosc-6339-hw3

amazon-reviews attention-mechanism bert bidirectional-lstm big-data cla deep-learning distilbert-model fine-tuning-bert lstm natural-language-processing nlp pyspark python sentiment-classification spark spark-ml transfer-learning transformers word2vec

Last synced: 04 Jan 2025

https://github.com/diegoribeiro2/analise_de_transcoes_pix_para_deteccao_de_fraudes_com_pyspark-_big_data

Case prático de análise de transações PIX com o objetivo de detectar fraudes, desde a coleta e entendimento dos dados até a modelagem de um algoritmo de detecção de fraudes.

jupyter-notebook python spark

Last synced: 04 Jan 2025

https://github.com/hfleitas/synapsedelta

Azure Synapse Analytics

delta-lake spark

Last synced: 09 Feb 2025

https://github.com/akaliutau/k8s-spark-operator

Harnessing Spark Operator in K8s cluster

docker helm-charts kuberentes spark spark-operator

Last synced: 11 Jan 2025

https://github.com/pierrekieffer/genericunsupervisedmachinelearning

Generic Clustering algorithm for Apache Spark deployment

kmeans machine-learning mllib silhouette spark

Last synced: 07 Feb 2025

https://github.com/lupusruber/rnmp_homework2

A recommendation system project that uses the Spark MLlib's ALS model to train and evaluate on the MovieLens dataset. Includes Dockerized setup, hyperparameter tuning, and evaluation metrics (RMSE, Precision@K, Recall@K, NDCG) for performance insights.

docker mllib recommender-system spark

Last synced: 09 Feb 2025

https://github.com/danieldacosta/etl-spark-stepfunctions

ETL pipeline using Spark on EMR cluster and Step functions for orchestrations.

aws aws-step-functions etl spark

Last synced: 11 Jan 2025

https://github.com/danieldacosta/etl-spark-parallel-stepfunctions

Execute EMR Jobs in parallel

emr spark step-functions

Last synced: 11 Jan 2025

https://github.com/ophiase/big-data-project-ifeby310

Analysis website of the New York Shared Bike systems (Citibikes 🚲️) dataset. Extract Load Transform using pyspark in parquet format.

bigdata spark

Last synced: 19 Jan 2025

https://github.com/kometen/parsexml

Run Databricks XML-parser from command line.

databricks sbt scala spark xml-parser

Last synced: 11 Jan 2025

https://github.com/lawal-hash/dataeng

dbt docker docker-compose kafka marge postgresql spark terraform

Last synced: 11 Jan 2025

https://github.com/rongfengliang/spark-k8s-deploy

spark-k8s-deploy

big-data docker kubernetes spark

Last synced: 11 Jan 2025

https://github.com/jeet1995/spark-stock-trading-simulator

This project aims at leveraging Apache's Spark framework to simulate possible profits and losses for a given portfolio and an initial investment value. The investment pattern on this portfolio is randomized with the help of Monte-Carlo simulations.

monte-carlo scala spark

Last synced: 04 Jan 2025

https://github.com/sjtufl/entroanomaly

Mining anomalies using traffic feature distuibutions

anomaly entropy spark

Last synced: 18 Jan 2025

https://github.com/latiefdatavisionary/datasea-spark-itb-2025

data-science-competition datathon itb spark

Last synced: 24 Jan 2025

https://github.com/nicor88/docker-examples

Examples to build docker images

cratedb docker kafka mysql postgres spark

Last synced: 19 Jan 2025

https://github.com/yasarsultan/taxi-trip-analysis

The NYC Taxi Trip Batch Data Pipeline automates processing of large-scale trip data using Apache Spark and Airflow, integrating AWS S3 and Google BigQuery for storage and analytics. It features scalable, containerized workflows with robust data validation.

airflow aws-s3 bash-script batch-processing bigquery data-lake data-warehouse docker python3 spark

Last synced: 11 Jan 2025

https://github.com/dueyfinster/pluralsight

Course Examples from Pluralsight

java kafka kubernetes python spark

Last synced: 12 Jan 2025

https://github.com/dougdss89/wideworldadventure

This repository includes all files that compose the design and unification of the databases AdventureWorks and WideWorldAdventure project.

bigdata databricks datalake datawarehouse dbt deltalake duckdb elt etl etl-pipeline spark

Last synced: 30 Jan 2025

https://github.com/ynazymko12/goit-de-hw-03

Homework for Data Engineering course

spark

Last synced: 30 Jan 2025

https://github.com/s-yazhini/pyspark-and-sparksql

In Azure DataBricks

azure-databricks cluster-analysis pyspark spark spark-sql

Last synced: 30 Jan 2025

https://github.com/purcellcjp/home_sales

This project demonstrated the usage of SparkSQL to read, query, cache, and analyze home sales data, providing insights into average prices based on various criteria.

big-data cache parquet spark spark-sql sql

Last synced: 30 Jan 2025

https://github.com/riccardorevalor/spark

Spark exercises

pyspark spark spark-rdd spark-sql

Last synced: 30 Jan 2025

https://github.com/abdelhaqs/pyspark_advanced_dataframe_concepts

This project provides a Docker-based setup to explore advanced PySpark DataFrame concepts using Jupyter notebooks. The environment includes all necessary dependencies, making it easy to get started with PySpark for data processing and analysis.

docker pyspark spark

Last synced: 30 Jan 2025

https://github.com/denisogr/kaggle-notebook-to-production

This is a study project. I get analytics/ML examples from Kaggle and use different technologies to re-implement them.

bigquery data-engineering gcp kaggle-competition kaggle-dataset python spark

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/spark-streaming

Spark Streaming Big Data Hadoop

big-data bigdata cassandra hadoop hdfs hive kafka mongodb scala spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/mqttspark

IOT Device MQTT Spark Streaming

cassandra gpio iot mqtt mqtt-broker mqtt-client raspberry-pi spark spark-streaming yarn

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/machine_learning

Machine Learning TensorFlow Neural Networks Deep Learning

classification data-analysis deep-learning deep-neural-networks flink jupyter-notebook keras machine-learning machinelearning-python perceptron python3 spark tensorflow

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/bigdata

elasticsearch hive hue kafka kafka-streams presto presto-cassandra-hive spark spark-hdfs-hive spark-streaming

Last synced: 12 Jan 2025

https://github.com/rupeshtr78/blog

Big Data Spark Hadoop Kafka Flink Spark Streaming

aws bigdata cassandra elasticsearch emr-cluster flink hadoop hive hue kafka mapreduce mongodb oozie spark sparkstreaming yarn

Last synced: 12 Jan 2025

https://github.com/margaretkhendre/home-sales-vs-big-data

In this repository, Google Collab is paired with SparkSQL to determine key metrics about home sales data. Spark is also used to create temporary views, partition data, and cache/unchache a temporary table in the process.

big-data googlecollab ipynb-jupyter-notebook pyspark spark sparksql sql