Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-15 00:25:38 UTC
- JSON Representation
https://github.com/ericlondon/spark-csv-to-elasticsearch
Spark CSV to Elasticsearch
apache csv docker elasticsearch export hadoop spark
Last synced: 12 Jan 2025
https://github.com/dmschauer/wap-pattern-iceberg-pyspark-aws-glue
About This repository shows how to implement the Write-Audit-Publish (WAP) pattern using Apache Spark and Apache Iceberg. It's aimed at Data Engineers who want to get started quickly.
apache-iceberg apache-spark aws aws-glue iceberg pyspark spark
Last synced: 31 Dec 2024
https://github.com/teo-sl/us_flights_analysis
This repository contains a dashboard to visualize the US flights data and notebooks for some ML tasks on the same data
big-data classification dash dashboard flights machine-learning plotly regression spark usa
Last synced: 16 Jan 2025
https://github.com/teo-sl/denver-crimes-dash
This is a simple dashboard made with Dash and Plotly (for the frontend) and Apache Spark (for the backend)
Last synced: 16 Jan 2025
https://github.com/ankushkhanna/spark-common
Spark Commons, some hacks to simplify programming with Spark.
Last synced: 17 Jan 2025
https://github.com/maheshwarineeraj/simplestreamingruleengine
Simple Rule-Engine for streaing data
bigdata rule-engine spark spark-sql spark-streaming streaming
Last synced: 22 Jan 2025
https://github.com/luismanuelamengual/neogroup-sparks
Great Server framework with MVC oriented structure
Last synced: 22 Jan 2025
https://github.com/gabrieltempass/sparkify-churn-prediction
A model to predict churn for a music streaming company, with Spark running on an AWS EMR cluster.
Last synced: 06 Feb 2025
https://github.com/nwtgck/wikipedia-word2vec-playground-spark
A playground of word2vec from Wikipedia Dump with Spark
Last synced: 06 Feb 2025
https://github.com/grihabor/spark-metrics-otel-collector
Example spark metrics configuration with opentelemetry collector
metrics opentelemetry opentelemetry-collector spark
Last synced: 13 Feb 2025
https://github.com/nwtgck/spark-wikipedia-dump-loader-example
An example of spark-wikipedia-dump-loader
example scala spark wikipedia-dump
Last synced: 06 Feb 2025
https://github.com/vitalibo/distributed-alarm-system
Simple distributed alarm system on top of Apache Spark
Last synced: 27 Dec 2024
https://github.com/xpcosmos/jaffle-shop
Modern Data Stack with DBT, PySpark, PostgresSQL and Docker
dbt docker docker-compose pyspark python spark
Last synced: 17 Jan 2025
https://github.com/vitalibo/aws-glue-java
Simple PoC that demonstrate usage Java in AWS Glue ETL pipelines.
Last synced: 27 Dec 2024
https://github.com/lorenzobloise/motion_insights
Application for real-time big data analysis from a Body Sensor Network, developed using Spark in Scala and Kafka
angular chartjs kafka real-time scala spark spark-sql spark-streaming
Last synced: 17 Jan 2025
https://github.com/vigneshss-07/bigdata_technologies
This repo contains all technical knowledge and implementation of big data technologies.
big-data hadoop hadoop-hdfs hbase hive hive-metastore kafka mapreduce-python pyspark spark sparksql
Last synced: 16 Jan 2025
https://github.com/adrianmarino/recommendation-system-approaches
Recommendation system approaches
deep-learning keras modin movielens ray recommender-system spark tensorflow
Last synced: 24 Jan 2025
https://github.com/georgeerol/georgeerol.github.io
George Fouche Portfolio
airflow android-application aws cassandra deploy full-stack-web-development jpa-hibernate postgresql react robotics robotics-simulation spark spring-boot spring-mvc spring-security
Last synced: 13 Jan 2025
https://github.com/springworks/node-scale-reader
Application that reads value off a scale from a local file stream
Last synced: 22 Jan 2025
https://github.com/javadbahoosh/spark-streaming-multi-language-docker
Dockerized infrastructure and boilerplate code for consuming Kafka topics with Spark Streaming in Scala, Python, and Java, featuring Redis integration for result aggregation.
Last synced: 30 Jan 2025
https://github.com/hatamiarash7/kubernetes-spark
Deploy Apache Spark cluster in Kubernetes
apache apache-spark kubernetes spark
Last synced: 03 Feb 2025
https://github.com/bst-depractice/spark_play
Setup and practice pyspark transformations
Last synced: 22 Jan 2025
https://github.com/francesco-biscaccia-carrara/bigdata_projects
Assignment repository for the Big Data Computing course at the University of Padova for the academic year 2023-2024.
big-data k-center-problem map-reduce reservoir-sampling spark spark-streaming sticky-sampling
Last synced: 22 Jan 2025
https://github.com/non-neutralzero/spark-feature-engineering-toolkit
Snippets of spark/scala code used to do some handy feature engineering
data-engineering feature-engineering feature-extraction scala spark spark-sql
Last synced: 22 Jan 2025
https://github.com/dutrevis/spark-resources-metrics-plugin
Spark plugin to retrieve metrics from a variety of cluster resources
Last synced: 30 Dec 2024
https://github.com/erikgartner/prometheus-cc-extractor
This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
big-data common-crawl data-extraction mapreduce spark
Last synced: 06 Feb 2025
https://github.com/hafizhhasyhari/Big-Data_AI_Streaming-Data-Visualization-2024
Big Data dengan Spark, Scala by hafizhhasyhari. Kuliah selama semester 3
big-data data-analytic data-besar data-science scala spark
Last synced: 08 Feb 2025
https://github.com/safaa-p/machine-failure-prediction
Predicting Machine failure using Machine learning on a synthetic dataset of an existing milling machine consisting of 10,000 data points
big-data classification clustering decision-tree machine-failure machine-learning neural-network pyspark-mllib spark spark-sql svm-classifier unbalanced-data
Last synced: 16 Jan 2025
https://github.com/roaajadaa/sparksearchengine
Build a small-scale spark-based search engine which searches in a list of documents to find those answering a user’s query.
bigdata indexing inverted-index mongodb scala search-engine spark
Last synced: 13 Feb 2025
https://github.com/jonny-binns/fase-practicals
Contains the practice work and code from lectures in the Formal Approaches to Software Engineering Module
Last synced: 18 Jan 2025
https://github.com/seilylook/spark_definition_guide_ch_3
Spark: The Definition Guide - Chapter 3
Last synced: 16 Jan 2025
https://github.com/billxsheng/oubre-sentiment-analysis
Complete data platform that performs sentiment analysis on tweets. Built using Cassandra, Kafka, Spark, Node, and React.
cassandra etl-pipeline java kafka nodejs sentiment-analysis spark twitter-api
Last synced: 17 Jan 2025
https://github.com/coreyauger/spark-ashley-madison-ml
spark ml hacking on the Ashley Madison dataset.
Last synced: 16 Jan 2025
https://github.com/dev88jerry/cs450
Bishop's University - CS450 Elements of Big Data
big-data data-science hadoop spark
Last synced: 08 Jan 2025
https://github.com/mustafahakkoz/preprocessing_w_spark
Preprocessing + Feature Extraction pipeline by SPARK and Neo4J
cypher feature-engineering mongodb neo4j nlp spark
Last synced: 28 Jan 2025
https://github.com/guysuphakit/etl_platform_nyc_taxi
airflow docker postgresql python spark
Last synced: 17 Jan 2025
https://github.com/brinthat/world-development-indicators
Exploring World Development Indicators: Identifying relationship between Health Indicators using Linear Regression & Classification of Income Group based on Health Indicators using Logistic Regression.
lasso-regularization linear-regression python ridge-regression-model scatter-plot spark spark-sql
Last synced: 17 Jan 2025
https://github.com/ssanthosh010303/collection-data-training
A collection of challenges exercised during data training program.
airflow apache azure azure-data-factory azure-databricks azure-logic-apps bigdata data hadoop spark
Last synced: 17 Jan 2025
https://github.com/majobasgall/bash_scripts_potpourri
A little bit of everything: cleanup Arch-based systems, extraction, system info, git config, project structuring, spark parameters, data backup, webcam control, and more!
archlinux bash-script data-science github-config manjaro-linux spark
Last synced: 08 Jan 2025
https://github.com/sharoonjoseph321/insurance_fraud_detection
Fraud Detection using machine learning algorithm-KN Neighbors .Data exploration using Pyspark and matplotlib.
analytics data data-science eda high-performance knn-algorithm knn-classification machine-learning matplotlib-pyplot pyspark python seaborn spark statistics
Last synced: 28 Jan 2025
https://github.com/yuhexiong/raw-sql-spark-submit-api-python-flask
apache-spark api backend flask python spark
Last synced: 28 Jan 2025
https://github.com/holdenk/luigi-rewrite-rug
WIP Proof of Concept Luigi Pipeline Rewriting for testing using rug
Last synced: 31 Jan 2025
https://github.com/asora6/java-8-features
Welcome to the Java 8 Features Repository! This repository highlights key features of Java 8, such as Lambda Expressions, Functional Interfaces, Method References, Constructor References, Stream API, and the Local Date and Time API. It includes notes and programs for practical understanding. Explore and enhance your knowledge of Java 8 features!
functional-interfaces gradle-plugin java java-date java7 java8-examples javascript jdk8 lambda-expressions lambda-functions nashorn optional spark streams-api
Last synced: 28 Jan 2025
https://github.com/rickymiura/slack-posts-eda
In this repository, I perform EDA on a large dataset of Slack posts using Apache Spark and AWS to efficiently uncover trends and insights at scale.
big-data distributed-computing spark
Last synced: 28 Jan 2025
https://github.com/juanmanuel-tirado/pyspark-tutorial
This is a collection of PySpark tutorials
jupyter-notebook machine-learning ml pyspark python spark tutorial
Last synced: 04 Feb 2025
https://github.com/sayamalt/pyspark-for-big-data-and-machine-learning
This is the material for Jose Portilla's Spark and Python for Big Data and ML course.
classification clustering decision-tree-classifier gbt-classification kmeans-clustering linear-regression linear-svc logistic-regression pyspark pyspark-machine-learning random-forest-classifier recommendation-systems regression spark spark-mllib spark-streaming spark-structured-streaming
Last synced: 17 Jan 2025
https://github.com/abinba/pe-analyzer
Preprocessor of PE (Portable Executable) files (dll, exe) using Spark.
pefile portable-executable postgres pyspark python spark
Last synced: 17 Jan 2025
https://github.com/simplexspatial/simplexspatial-data-distribution-analysis
Analisys of data distribution of OSM dataset.
Last synced: 15 Jan 2025
https://github.com/sai-mohan-b/spark-structured-streaming
This repo is for the Structured Streaming and Projects
pyspark-notebook spark spark-streaming
Last synced: 09 Jan 2025
https://github.com/dina-hosny/data-engineering-capstone-project
Data Engineering Capstone Project - Udacity Data Engineering Expert Track.
analytics cassandra data-engineering data-pipelines data-science etl fwd spark udacity
Last synced: 13 Jan 2025
https://github.com/maxyermayank/spark-redis-python
Batch writes into Redis using Python and Spark
aws aws-memory-db bulk-loader memorydb redis redis-client redis-py spark
Last synced: 17 Jan 2025
https://github.com/librity/rtjvm_spark_streaming
Rock The JVM - Apache Spark Streaming
akka apache-spark docker kafka kafka-streams scala spark spark-streaming twitter-api
Last synced: 08 Jan 2025
https://github.com/mxagar/data_engineering_guide
Personal notes on the IBM Data Engineering Certificate as well as other sources focusing on AWS.
airflow aws data-lake data-modeling data-pipelines data-science no-sql spark sql warehouse
Last synced: 23 Dec 2024
https://github.com/sanogotech/docker-airflowsparkkafkadata-engineeringend-to-end
Docker Apache Airflow Data Engineering End-to-End Project — Spark, Kafka, Airflow, Docker, Cassandra, Python
airflow cassandra cassandra-database dataengineering docker kafka python spark
Last synced: 23 Jan 2025
https://github.com/johngodoi/scalasparkkafka
This code just loads data to kafka through apache spark and reads it back.
docker docker-compose kafka spark spark-kafka spark-sql spark-streaming
Last synced: 06 Feb 2025
https://github.com/zkan/data-wrangling-with-spark
Data Wrangling with Spark
data-engineering data-wrangling pyspark python spark
Last synced: 12 Feb 2025
https://github.com/zkan/dtc-data-engineering-zoomcamp
DataTalks.Club's Data Engineering Zoomcamp
airflow data-engineering dbt docker kafka spark
Last synced: 12 Feb 2025
https://github.com/tupol/spark-xkmeans
Extension to the standard K-Means implementation of Spark ML library
clustering kmeans kmeans-clustering library machine-learning scala spark
Last synced: 17 Jan 2025
https://github.com/cvinicius987/projetos-bigdata
Estudos de caso envolvendo projetos de BigData e Engenharia de Dados.
bigdata data data-engineering spark
Last synced: 12 Jan 2025
https://github.com/kajal-52/spark-with-scala-learning
This project covers examples about RDD, dataframe, and dataset API with Apache Spark using Scala.
Last synced: 28 Jan 2025
https://github.com/dirmeier/spark-travis
Testing Apache Spark using Travis.
Last synced: 17 Jan 2025
https://github.com/hrolive/patc-big-data-analytics-bsc
Introduction to the main concepts and technologies related to Big Data and Data Analytics and its applications to real projects.
analytics bias big-data data-analysis hadoop hpc machine-learning mapreduce nosql python spark spark-streaming visualization
Last synced: 04 Jan 2025
https://github.com/f-lab-edu/commerce-sessionization
사용자 행동 데이터 세션화를 위한 Spark-Airflow 파이프라인 구축
Last synced: 01 Feb 2025
https://github.com/chen0040/spark-tabular-analytics
Spark statistical inference framework for performing column pair-wise data analytics for large data table
anova chi-square-test confidence-intervals data-analysis hypothesis-testing spark statistical-inference tabular-data
Last synced: 09 Feb 2025
https://github.com/swarup4741/spark
A CLI to bootstrap a vanilla js project
cli javascript spark vanillajs
Last synced: 08 Jan 2025
https://github.com/samuelbarbosadev/justweb_technical_test
Esse é um teste técnico para a vaga de Desenvolvedor Python Pleno.
Last synced: 27 Jan 2025
https://github.com/timvisee/hhs-p7-spark-docker
:whale: Docker container for Spark on college (HHS).
college docker docker-container jupyter-notebook pyspark spark
Last synced: 15 Jan 2025
https://github.com/pixelbyaj/apache-spark
Start Apache Spark with Python - pyspark
apache-spark pyspark-python python spark winutils
Last synced: 17 Jan 2025
https://github.com/dexterposh/azurehdinsight
Repository housing the artifacts to deploy the Hadoop clusters on Azure for my learning.
azure hadoop hdinsight-cluster learning-by-doing spark
Last synced: 04 Jan 2025
https://github.com/chen0040/spark-ml-commons
Package provides common utility for spark ml
Last synced: 09 Feb 2025
https://github.com/robertdavidwest/blog-post-analysis
Analysing blog post data using Spark/Scala
Last synced: 15 Jan 2025
https://github.com/sebastianruizm/data-eng-coding-challenge
PoC Data Migration
airflow docker fastapi postgresql spark superset
Last synced: 08 Jan 2025
https://github.com/chimera-suite/test-spark-datatypes
Test OntopSpark's OBDA datatype conversions to be compliant with W3C standards
Last synced: 03 Jan 2025
https://github.com/ammahmoudi/mapreduce-examples
Map Rduce Examples using pure Scala and Then using Spark
map-reduce mapreduce scala spark spark-mapreduce
Last synced: 15 Jan 2025
https://github.com/mineshmelvin/aws-glue-scala-accident-analysis
This guide outlines procedures for developing Apache Spark jobs in Scala for AWS Glue deployment. It covers setting up environment variables, installing IntelliJ IDEA with the Scala plugin, and creating a Scala Maven project serving as a starting point for developers leveraging Spark in Glue scalable Data Processing applications.
Last synced: 09 Feb 2025
https://github.com/shuuji3/spark-ceph-connector
🌟Spark Ceph Connector: Implementation of Hadoop Filesystem API for Ceph
apache-hadoop apache-spark ceph hadoop spark
Last synced: 27 Jan 2025
https://github.com/jofaval/mnist
Computer Vision with Neural Networks for handwritten digits recognition in 1998
classification computer-vision data-science deep-learning deep-neural-networks google-colab keras matplotlib pyspark python pytorch seaborn spark tensorflow
Last synced: 04 Feb 2025
https://github.com/armahdavi/big_data_spark_building_iot_sensor_ieq_analytics_ml_thermal_comfort_murb_retrofit_social_housing
This repository summarizes my analytics, big data, and ML code work from a Multi-Unit Residential Building (MURB) retrofit project run back during my Ph.D.,
matplotlib-pyplot pandas polars psypy pyspark pythermalcomfort python seaborn sklearn spark sparksql stata
Last synced: 13 Feb 2025
https://github.com/captainirs/hadoop-yarn-k8s
A sandbox for running a Hadoop-YARN cluster on Kubernetes
Last synced: 11 Jan 2025
https://github.com/yosrak5/data-streaming
This project involves the development of a robust data engineering pipeline that orchestrates the seamless ingestion, processing, and storage of data .
airflow-dags apache cassandra docker etl kafka python spark
Last synced: 11 Dec 2024
https://github.com/iampavangandhi/sparkfoundationtask
Spark Foundation Basic UI Task
Last synced: 17 Jan 2025
https://github.com/ayresgneto/use-case-gcp-etl
ELT pipeline GCP. Tecnologias utilizadas: Postgresql, GCP Storage, Airflow (local), Pyspark (local), BigQuery
airflow big-data bigquery data data-engineering etl gcp pipeline postgresql programming-oriented-object pyspark python spark
Last synced: 21 Jan 2025