Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Apache Spark
![](https://explore-feed.github.com/topics/spark/spark.png)
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- GitHub: https://github.com/topics/spark
- Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark
- Repo: https://github.com/apache/spark
- Created by: Matei Zaharia
- Released: May 26, 2014
- Related Topics: scala, hadoop,
- Aliases: apache-spark,
- Last updated: 2025-02-15 00:25:38 UTC
- JSON Representation
https://github.com/brinthat/world-development-indicators
Exploring World Development Indicators: Identifying relationship between Health Indicators using Linear Regression & Classification of Income Group based on Health Indicators using Logistic Regression.
lasso-regularization linear-regression python ridge-regression-model scatter-plot spark spark-sql
Last synced: 17 Jan 2025
https://github.com/guysuphakit/etl_platform_nyc_taxi
airflow docker postgresql python spark
Last synced: 17 Jan 2025
https://github.com/aidenfockens/aiden_neel_healthevents
Consumes Kafka health events, holding them in AWS RDS, before using flask to query and perform EDA
flask kafka kubernetes spark sql
Last synced: 13 Jan 2025
https://github.com/dexterposh/azurehdinsight
Repository housing the artifacts to deploy the Hadoop clusters on Azure for my learning.
azure hadoop hdinsight-cluster learning-by-doing spark
Last synced: 04 Jan 2025
https://github.com/mustafahakkoz/preprocessing_w_spark
Preprocessing + Feature Extraction pipeline by SPARK and Neo4J
cypher feature-engineering mongodb neo4j nlp spark
Last synced: 28 Jan 2025
https://github.com/dev88jerry/cs450
Bishop's University - CS450 Elements of Big Data
big-data data-science hadoop spark
Last synced: 08 Jan 2025
https://github.com/coreyauger/spark-ashley-madison-ml
spark ml hacking on the Ashley Madison dataset.
Last synced: 16 Jan 2025
https://github.com/maheshwarineeraj/quickcodeblocks
Collection of small reusable code blocks and automations
Last synced: 13 Jan 2025
https://github.com/robertdavidwest/blog-post-analysis
Analysing blog post data using Spark/Scala
Last synced: 15 Jan 2025
https://github.com/sebastianruizm/data-eng-coding-challenge
PoC Data Migration
airflow docker fastapi postgresql spark superset
Last synced: 08 Jan 2025
https://github.com/chimera-suite/test-spark-datatypes
Test OntopSpark's OBDA datatype conversions to be compliant with W3C standards
Last synced: 03 Jan 2025
https://github.com/ammahmoudi/mapreduce-examples
Map Rduce Examples using pure Scala and Then using Spark
map-reduce mapreduce scala spark spark-mapreduce
Last synced: 15 Jan 2025
https://github.com/shuuji3/spark-ceph-connector
🌟Spark Ceph Connector: Implementation of Hadoop Filesystem API for Ceph
apache-hadoop apache-spark ceph hadoop spark
Last synced: 27 Jan 2025
https://github.com/jofaval/mnist
Computer Vision with Neural Networks for handwritten digits recognition in 1998
classification computer-vision data-science deep-learning deep-neural-networks google-colab keras matplotlib pyspark python pytorch seaborn spark tensorflow
Last synced: 04 Feb 2025
https://github.com/billxsheng/oubre-sentiment-analysis
Complete data platform that performs sentiment analysis on tweets. Built using Cassandra, Kafka, Spark, Node, and React.
cassandra etl-pipeline java kafka nodejs sentiment-analysis spark twitter-api
Last synced: 17 Jan 2025
https://github.com/seilylook/spark_definition_guide_ch_3
Spark: The Definition Guide - Chapter 3
Last synced: 16 Jan 2025
https://github.com/jonny-binns/fase-practicals
Contains the practice work and code from lectures in the Formal Approaches to Software Engineering Module
Last synced: 18 Jan 2025
https://github.com/safaa-p/machine-failure-prediction
Predicting Machine failure using Machine learning on a synthetic dataset of an existing milling machine consisting of 10,000 data points
big-data classification clustering decision-tree machine-failure machine-learning neural-network pyspark-mllib spark spark-sql svm-classifier unbalanced-data
Last synced: 16 Jan 2025
https://github.com/hafizhhasyhari/Big-Data_AI_Streaming-Data-Visualization-2024
Big Data dengan Spark, Scala by hafizhhasyhari. Kuliah selama semester 3
big-data data-analytic data-besar data-science scala spark
Last synced: 08 Feb 2025
https://github.com/captainirs/hadoop-yarn-k8s
A sandbox for running a Hadoop-YARN cluster on Kubernetes
Last synced: 11 Jan 2025
https://github.com/dutrevis/spark-resources-metrics-plugin
Spark plugin to retrieve metrics from a variety of cluster resources
Last synced: 30 Dec 2024
https://github.com/non-neutralzero/spark-feature-engineering-toolkit
Snippets of spark/scala code used to do some handy feature engineering
data-engineering feature-engineering feature-extraction scala spark spark-sql
Last synced: 22 Jan 2025
https://github.com/francesco-biscaccia-carrara/bigdata_projects
Assignment repository for the Big Data Computing course at the University of Padova for the academic year 2023-2024.
big-data k-center-problem map-reduce reservoir-sampling spark spark-streaming sticky-sampling
Last synced: 22 Jan 2025
https://github.com/bst-depractice/spark_play
Setup and practice pyspark transformations
Last synced: 22 Jan 2025
https://github.com/flynn3103/loadhouse
Loading data into the Lakehouse using JSON configuration and utilities for ETL tasks.
Last synced: 13 Jan 2025
https://github.com/ayresgneto/use-case-gcp-etl
ELT pipeline GCP. Tecnologias utilizadas: Postgresql, GCP Storage, Airflow (local), Pyspark (local), BigQuery
airflow big-data bigquery data data-engineering etl gcp pipeline postgresql programming-oriented-object pyspark python spark
Last synced: 21 Jan 2025
https://github.com/springworks/node-scale-reader
Application that reads value off a scale from a local file stream
Last synced: 22 Jan 2025
https://github.com/matteofasulo/contact-center_databricks
Analysis of Contact Center data on DataBricks
databricks datamining pandas rdbms spark
Last synced: 20 Jan 2025
https://github.com/ebonnal/annotweet
Sentiment Analysis project on tweets.
classification nlp nlp-machine-learning scala sentiment-analysis spark spark-ml tweets twitter
Last synced: 21 Jan 2025
https://github.com/deepcloudlabs/big-data-essentials
DCL-700: Big Data Essentials
hadoop-3 hadoop-mapreduce hdfs machine-learning spark spark-ml spark-sql spark-streaming
Last synced: 08 Jan 2025
https://github.com/deepcloudlabs/dcl700-2021-jun-21
DCL-700: Big Data Essentials
big-data hadoop hadoop-mapreduce hdfs hive spark spark-sql spark-streaming spark-streaming-kafka
Last synced: 08 Jan 2025
https://github.com/rurumimic/spark-cheatsheet
cheatsheet
cheatsheet hadoop hive sbt scala spark
Last synced: 03 Jan 2025
https://github.com/adrianmarino/recommendation-system-approaches
Recommendation system approaches
deep-learning keras modin movielens ray recommender-system spark tensorflow
Last synced: 24 Jan 2025
https://github.com/vigneshss-07/bigdata_technologies
This repo contains all technical knowledge and implementation of big data technologies.
big-data hadoop hadoop-hdfs hbase hive hive-metastore kafka mapreduce-python pyspark spark sparksql
Last synced: 16 Jan 2025
https://github.com/luismanuelamengual/neogroup-sparks
Great Server framework with MVC oriented structure
Last synced: 22 Jan 2025
https://github.com/maheshwarineeraj/simplestreamingruleengine
Simple Rule-Engine for streaing data
bigdata rule-engine spark spark-sql spark-streaming streaming
Last synced: 22 Jan 2025
https://github.com/archie-cm/real_time_product_recommendations_with_machine_learning_on_gcp
This project demonstrates how to build a real-time product recommendation system using Pub/Sub Lite and Apache Spark with Dataproc
Last synced: 13 Jan 2025
https://github.com/ankushkhanna/spark-common
Spark Commons, some hacks to simplify programming with Spark.
Last synced: 17 Jan 2025
https://github.com/teo-sl/denver-crimes-dash
This is a simple dashboard made with Dash and Plotly (for the frontend) and Apache Spark (for the backend)
Last synced: 16 Jan 2025
https://github.com/teo-sl/us_flights_analysis
This repository contains a dashboard to visualize the US flights data and notebooks for some ML tasks on the same data
big-data classification dash dashboard flights machine-learning plotly regression spark usa
Last synced: 16 Jan 2025
https://github.com/ericlondon/spark-csv-to-elasticsearch
Spark CSV to Elasticsearch
apache csv docker elasticsearch export hadoop spark
Last synced: 12 Jan 2025
https://github.com/easonlai/log_analytics_with_databricks
Azure Databricks notebook sample to connect Blob Storage of Azure Log Analytics
azure azure-databricks azure-log-analytics azure-storage blob-storage data-analysis-python data-analytics data-wrangling databricks pyspark pyspark-notebook spark
Last synced: 08 Jan 2025
https://github.com/fsanaulla/terling
Linguistic text analysis for detecting terrorists dangerous.
Last synced: 17 Jan 2025
https://github.com/manuelmtzv/spark_flask_cluster
Docker configuration for setting up a cluster environment with Apache Spark and Flask.
devcontainer docker python spark
Last synced: 31 Jan 2025
https://github.com/afsalthaj/biggudeta-kyukyoku
This project is made for personal learning purpose, and to show case an instance of an end to end solution in a Big Data environment.
Last synced: 08 Jan 2025
https://github.com/yash-chauhan-dev/etl_airbnb_listing
A scalable ETL pipeline for transforming raw Airbnb listings data into a structured format for price and availability analysis. Built using Spark, HDFS, PostgreSQL, Airflow, and Docker.
airflow docker docker-compose etl etl-pipeline hdfs local-development postgresql python spark sql
Last synced: 11 Feb 2025
https://github.com/colinkiama/snippets
Code snippets used by the Spark Community
code-snippets snippets snippets-collection snippets-library spark uwp
Last synced: 14 Jan 2025
https://github.com/dimitrov-s-dev/pyspark
PySpark
pyspark python3 spark spark-sql
Last synced: 16 Jan 2025
https://github.com/evertonsavio/spark-big-data-analitycs
Base Codes Repository for technologies related to Big Data such as Spark, Kafka, Storm and others. Languages: Python, Java
Last synced: 02 Jan 2025
https://github.com/manojkarthick/rightfluencer
An interactive web application and dashboard that allows you to find the right influencers for your brand by analyzing their posts, images and videos.
facebook google-cloud instagram marketing spark twitter youtube
Last synced: 12 Nov 2024
https://github.com/benmizrahi/duckspark
duckspark - A DuckDB based distributed data processing engine
data-engineering distributed-systems golang spark
Last synced: 27 Dec 2024
https://github.com/memojja/basic-recommendation-engine
for learning apache spark and mlib library
Last synced: 21 Jan 2025
https://github.com/ndleah/stedi
Data Lakehouse solution for machine learning data
aws-athena aws-glue s3-bucket spark
Last synced: 12 Jan 2025
https://github.com/oracle-quickstart/oci-mapr
Terraform module to deploy MapR on Oracle Cloud Infrastructure (OCI)
cloud hadoop mapr oci oracle partner-led spark terraform
Last synced: 07 Nov 2024
https://github.com/milankinen/las-emr
AWS EMR parallelized SeCo Lexical Analysis Services for big data
aws big-data emr finnish nlp spark text-processing
Last synced: 19 Jan 2025
https://github.com/asmrcodez-yt/realtime-voting-dataengineer-spark-kafka
kafka postgres python spark spark-streaming streamlit
Last synced: 24 Jan 2025
https://github.com/librity/rtjvm_spark_tuning
Rock The JVM - Spark Performance Tuning with Scala
Last synced: 08 Jan 2025
https://github.com/okdp/jupyterlab-docker
okdp jupyterlab docker images
datascience docker jupyter jupyter-notebook jupyterhub jupyterlab k8s-spark python spark spark-kubernetes spark-python
Last synced: 13 Nov 2024
https://github.com/librity/rtjvm_spark_optimizations
Rock The JVM - Spark Optimizations with Scala
Last synced: 08 Jan 2025
https://github.com/piero24/big-data_hw_23-24
Exercises in Java and Spark for the Big Data Computing course at unipd
big-data clustering fft java mapreduce sampling spark streaming
Last synced: 08 Jan 2025
https://github.com/zsomborjoel/pyspark-basics
Teaching and learning the functionality of the Spark Python API on dataframes
Last synced: 11 Feb 2025
https://github.com/anras5/nyc-yellow-taxi
Processing data streams with Kafka + Spark
docker google-cloud kafka postgresql spark spark-streaming
Last synced: 21 Jan 2025
https://github.com/senior-sigan/coursera_scala_specialization
coursera coursera-data-science scala spark
Last synced: 17 Jan 2025
https://github.com/gabrielenizzoli/spark_engine
Build a complex spark execution plan by composing many different spark operations.
Last synced: 12 Feb 2025
https://github.com/hadarsharon/compars
DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻❄️ DataFrame comparison library)
data-engineering data-profiling data-quality dataframe dataframes koalas pandas polars pyspark python rust spark
Last synced: 22 Jan 2025
https://github.com/banknatchapol/us-immigration-data-pipeline
Create Data Pipeline for US Imigration data using Spark.
Last synced: 27 Jan 2025
https://github.com/chen0040/spark-ml-recommender
Package provides java implementation of big-data recommend-er using Apache Spark
alternating-least-squares content-collaborative-filtering cosine-similarity jaccard-similarity pearson-correlation rdd recommender recommender-system spark spark-ml
Last synced: 09 Feb 2025
https://github.com/traunguyentvt/study_big_data_technology
Kafka, Spark Streaming, Spark SQL, Hive, Tableau
api hive json kafka spark spark-sql spark-streaming tableau
Last synced: 14 Jan 2025
https://github.com/iamdsc/bigdataanalytics
Using Spark with Python for analyzing Big Data.
big-data jupyter-notebook python spark
Last synced: 18 Jan 2025
https://github.com/sirnicholas1st/feedback_processor
This repository contains a simple Flask application that serves as a customer feedback form. The submitted data is sent to a Kafka topic. The Kafka consumer, implemented as a Spark application, processes the data and writes it to a Cassandra table for further analysis.
Last synced: 15 Jan 2025
https://github.com/yc1999/scalasparkinaction-peopleyoumightknow
二度好友编程实验
cloudcomputing peopleyoumightknow scala spark
Last synced: 21 Jan 2025
https://github.com/justinjjlee/simulation-discrete
Employing data transformations and simulations to answer random questions
analytics data data-science julia python simulation spark
Last synced: 28 Jan 2025
https://github.com/sumanthvrao/ipl-spark-analysis
Predict outcomes of IPL Cricket Matches for the year 2018 using Spark MLLib framework.
decision-tree kmeans-clustering pyspark spark spark-mllib-library
Last synced: 08 Jan 2025
https://github.com/queraltsm/ada-spark-exercises
Compilation of exercises in Ada of formal verifications with Spark 2014
Last synced: 28 Jan 2025
https://github.com/owengregson/sparkev-ui
A recreation of the Tesla UI Interface, made in HTML, CSS, and JS for my EV project.
css3 html5 javascript spark spark-ev tesla tesla-clone tesla-ui tesla-ui-clone ui ui-design web
Last synced: 22 Jan 2025
https://github.com/josericodata/mscdataanalyticssecondsemesterassignmentone
Summary of Assignment One from the Second semester of the MSc in Data Analytics program. This repository contains the CA1 assignment guidelines from the college and my submission. To see all original commits and progress, please visit the original repository using the link below.
advanced-data-analysis big-data big-data-storage-and-processing cct-college cnn-keras data-science dropout-layers dublin hadoop ireland jose-maria-rico-leal jose-rico jupyter-notebook machine-learning msc mysql neural-network rdbms spark ubuntu-linux
Last synced: 17 Jan 2025
https://github.com/bryanbill/tracker
Wildlife animal tracking application
animals handlebars java postgresql spark
Last synced: 26 Dec 2024