Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/turnipdo/spark-standalone-cluster-setup

To facilitate the initial setup of Apache Spark, this repository provides a beginner-friendly, step-by-step guide on setting up a master node and two worker nodes.

python spark spark-cluster

Last synced: 24 Dec 2024

https://github.com/abdoufermat5/twitter-analysis

Twitter data analysis using Pyspark

data-analysis pyspark spark twitter twitter-api

Last synced: 10 Jan 2025

https://github.com/flynn3103/loadhouse-toolkit

Loading data into the Lakehouse using JSON configuration and utilities for ETL tasks.

delta-lake spark

Last synced: 14 Feb 2025

https://github.com/pierrekieffer/hbaseconnector

Scala Spark Hbase Connector with filtered query

hbase spark spark-hbase

Last synced: 07 Feb 2025

https://github.com/pierrekieffer/dataframesparkstreaming

Easy implementation of Apache Spark Streaming for dataframes

dataframe spark spark-streaming

Last synced: 07 Feb 2025

https://github.com/davila23/utn-dds

UTN - Diseño de Sistemas

java junit maven spark

Last synced: 21 Jan 2025

https://github.com/piero24/big-data_hw_23-24

Exercises in Java and Spark for the Big Data Computing course at unipd

big-data clustering fft java mapreduce sampling spark streaming

Last synced: 08 Jan 2025

https://github.com/zsomborjoel/pyspark-basics

Teaching and learning the functionality of the Spark Python API on dataframes

basics dataframes spark

Last synced: 11 Feb 2025

https://github.com/anras5/nyc-yellow-taxi

Processing data streams with Kafka + Spark

docker google-cloud kafka postgresql spark spark-streaming

Last synced: 21 Jan 2025

https://github.com/alexcombessie/ensae_scala-spark

Several mini-projects in Scala/Spark for the "Computer Science for the analysis of Big Data" course at ENSAE ParisTech

ensae-paristech scala spark

Last synced: 24 Dec 2024

https://github.com/banknatchapol/us-immigration-data-pipeline

Create Data Pipeline for US Imigration data using Spark.

data-pipeline spark

Last synced: 27 Jan 2025

https://github.com/justinjjlee/simulation-discrete

Employing data transformations and simulations to answer random questions

analytics data data-science julia python simulation spark

Last synced: 28 Jan 2025

https://github.com/sumanthvrao/ipl-spark-analysis

Predict outcomes of IPL Cricket Matches for the year 2018 using Spark MLLib framework.

decision-tree kmeans-clustering pyspark spark spark-mllib-library

Last synced: 08 Jan 2025

https://github.com/msampathkumar/scalaprojects

Scala Projects - From Scala basic learning tutorials to Big Data(Apache Spark) projects

apache-spark scala spark

Last synced: 01 Jan 2025

https://github.com/bryanbill/tracker

Wildlife animal tracking application

animals handlebars java postgresql spark

Last synced: 26 Dec 2024

https://github.com/uselessscat/spark-kafka-example

Data pipeline using nifi, kafka and spark

docker hadoop kafka nifi scala spark

Last synced: 14 Jan 2025

https://github.com/omr5221/esbi_spark

Use Spark with Scala to Pivot data

scala spark

Last synced: 27 Jan 2025

https://github.com/chucheng92/structuredstreaming

Structured Streaming Demo

spark streaming

Last synced: 01 Feb 2025

https://github.com/ccao-data/service-spark-iasworld

Service for extracting tables from the CCAO system-of-record and uploading them to the Data Department's data warehouse

etl iasworld spark

Last synced: 14 Feb 2025

https://github.com/hussein-awala/stream-applications

A repository contains some examples for stream processing applications using spark structured streaming, Kafka Streams, and some other tools like Apache Hudi...

hudi kafka kafka-connect kafka-streams spark spark-streaming

Last synced: 01 Feb 2025

https://github.com/vitalibo/aws-glue-java

Simple PoC that demonstrate usage Java in AWS Glue ETL pipelines.

aws glue spark

Last synced: 27 Dec 2024

https://github.com/casschow98/spotify_insights_project

Welcome to the Spotify Insights Data Pipeline Project where I analyze data from my Spotify listening history ~

airflow big-query data-analytics data-engineering docker etl pandas pyspark python song-analysis spark spotify-api terraform

Last synced: 14 Feb 2025

https://github.com/aravind2060/hr_data_analysis_with_spark_structured_api

This assignment helps students learn how to use filtering, grouping, aggregation operations

apache-spark docker docker-compose filtering spark

Last synced: 21 Jan 2025

https://github.com/samiksha-khare/crypto-real-time-analysis-using-kafka

This project showcases the process of streaming real-time cryptocurrency data using Kafka, storing the data in a MongoDB database, and visualizing the price trends over time with Python libraries like Matplotlib.

api cryptocurrency docker kafka kafka-consumer kafka-producer kafka-topic matplotlib mongodb nosql python real-time spark streamlit visualization zookeeper

Last synced: 21 Jan 2025

https://github.com/edwin-huber/sparkandlivyonaksspot

Repo to support POC Documentation for the use of Livy, AKS and Azure Spot Instances

azure kubernetes spark

Last synced: 14 Jan 2025

https://github.com/vitalibo/distributed-alarm-system

Simple distributed alarm system on top of Apache Spark

aws azure spark

Last synced: 27 Dec 2024

https://github.com/riolaf05/spark-elasticsearch-recommendation

Recommendation system using Alternating Least Squares(ALS) and Cosine Similarity on PySpark and Elasticsearch

collaborative-filtering docker elasticsearch machine-learning pyspark recommendation-system spark

Last synced: 21 Jan 2025

https://github.com/feliciamarlove/streaming-with-scala-and-spark

Related to Handling Fast Data with Apache Spark SQL and Streaming course on Pluralsight https://app.pluralsight.com/library/courses/apache-spark-sql-fast-data-handling-streaming/exercise-files

data-engineering hive parquet scala spark streaming

Last synced: 11 Feb 2025

https://github.com/shink/spark-ml-algorithm-docker

Spark ML algorithms on docker

docker-image spark spark-ml

Last synced: 01 Feb 2025

https://github.com/mauroslucios/pysparkwithpython

https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession

docker jupyter-notebook linux notebook pyspark python spark visual-studio-code

Last synced: 14 Feb 2025

https://github.com/thanhvie/de_udacity_capstone

Project: This is the capstone project of my nanodegree data engineering program at Udacity

spark

Last synced: 14 Jan 2025

https://github.com/holgerbrandl/spark_image_labeling

Image labeling experiments using apache spark

benchmarking hpc scala spark

Last synced: 21 Jan 2025

https://github.com/jakipatryk/spark-persistent-homology

(WIP, not fast enough for any production usage yet) Library for persistent homology computations in Apache Spark.

persistent-homology spark tda topological-data-analysis

Last synced: 30 Dec 2024

https://github.com/n370/spark-quickstart

A barebones seed repository for starting Spark Java micro framework projects.

java micro-framework spark sparkjava sparkjava-framework

Last synced: 21 Jan 2025

https://github.com/bakansm/vithsd

Vietnamese Hate Speech Detection with real-time data from streaming platform such as Youtube, Facebook and Tiktok.

kafka machine-learning nlp real-time-data spark streaming-data

Last synced: 14 Jan 2025

https://github.com/plave0/pp

Programming paradigms course material.

functional-programming haskell prolog python scala spark

Last synced: 07 Jan 2025

https://github.com/karimosman89/iot-predictive-maintenance

This repository will simulate an IoT-based predictive maintenance system designed to monitor industrial equipment through sensors. It will include data ingestion, processing, and machine learning components to predict potential failures, optimizing maintenance schedules and reducing downtime.

api cloud-platform dashboard data-collection data-processing deployment iot-platform predictive-analytics pressure-sensor real-time-sensor sensors spark temperature-sensor vibration

Last synced: 05 Jan 2025

https://github.com/manoharvit/ecommerce-dive-deep-sales-analysis

In this project, we developed an ETL pipeline using Apache Airflow to process delivery data and track delayed shipments. The pipeline downloads data from an AWS S3 bucket, cleans it using Spark/Spark SQL to identify missing delivery deadlines, and uploads the cleaned dataset back to S3. This ensures efficient delivery performance tracking.

airflow airflow-dags ecommerce elt pyspark s3 s3-bucket spark sql

Last synced: 14 Feb 2025

https://github.com/phelipe-sempreboni/programming

Repository for programming languages ​​of various types.

css html java javascript python spark vba

Last synced: 14 Feb 2025

https://github.com/joekakone/get-started-with-pyspark

PySpark Tutorials for Beginners

apache-spark pyspark spark

Last synced: 14 Jan 2025

https://github.com/abdelmajidlh/spark_practices

Cas pratiques d'utilisation de Apache Spark avec scala.

apach-spark apache machine-learning scala spark

Last synced: 27 Jan 2025

https://github.com/oguzhanfatihkucuk/data-analytics-project-kafka-spark

The data in this project was collected in a database using Apache Kafka and processed with Apache Spark Streaming. The project aims to create a forecasting model and analyze sales forecasts per customer.

big-data data data-visualization hadoop kafka ml mlpipeline plt pyhton spark

Last synced: 25 Dec 2024

https://github.com/asolimando/map-spark

Playing around with Map datatype in Spark

mapdatatype spark

Last synced: 01 Feb 2025

https://github.com/thanaraklee/pyspark-big-data-rdd-operations

This project illustrates Apache Spark RDD operations, from creation and transformation to actions and results, enhancing users' understanding of distributed data processing.

big-data pyspark python rdds spark

Last synced: 25 Dec 2024

https://github.com/leo-the-nardo/combopurifier

Data Pipeline from AWS SQS/S3 to Kubernetes w/ Spark using Airflow, EKS & Data Lakehouse

airflow argocd aws-glue-catalog aws-lake-formation aws-s3 aws-sqs data-lake delta-lake eks minio spark terraform

Last synced: 05 Jan 2025

https://github.com/bousettayounes/real-time-processing-of-users-data

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system

airflow cassandra dataengineering datastreaming docker kafka postgresql spark streaming

Last synced: 05 Jan 2025

https://github.com/aelesbao/twitter-analyser

Analyses streams of tweets using Kafka and Apache Spark

kafka scala spark spark-streaming

Last synced: 05 Jan 2025

https://github.com/ranimbenmbarek/airplane-crash-data-streaming

The Airplane Crash Data Analysis & Visualization project uses Kafka and Spark for streaming analysis of airplane crashes from 1908 to 2009, enabling real-time insights into trends and patterns in aviation safety.

hbase hdfs kafka kaggle powerbi spark

Last synced: 03 Feb 2025

https://github.com/ranimbenmbarek/airplane-crash-batch-analysis

The Airplane Crash Data Analysis & Visualization project utilizes Spark for batch processing of historical airplane crashes from 1908 to 2009. This approach enables the analysis of trends in aviation safety by examining patterns in crash occurrences and fatalities within the dataset.

hbase hdfs kafka kaggle spark

Last synced: 03 Feb 2025

https://github.com/flynn3103/is405.m21-bigdata

Implement Streaming on Big Data using Kafka, HDFS, Spark, ..

hdfs kafka spark

Last synced: 21 Jan 2025

https://github.com/29dch/myscalacodesaboutbigdata

My Scala learning code with bigdata

actor akka kafka scala spark spark-sql spark-streaming

Last synced: 10 Jan 2025

https://github.com/varunu28/chicago-crime-data-analysis

An analysis of Chicago crime data using Apache Spark

scala spark

Last synced: 01 Jan 2025

https://github.com/yannbolliger/g5-parallel-sgd-spark

Group 5 project for Systems for Data Science course @ EPFL, 2019.

scala sgd sgd-svm spark

Last synced: 01 Jan 2025

https://github.com/shrikantnaidu/apache-spark

Spark and Spark ML notebooks

spark spark-ml spark-sql udacity

Last synced: 10 Jan 2025

https://github.com/ilieschibane/projet-iot-cloud-bigdata

Implémentation d'une pipeline permettant de faire la prédiction de la maladie de parkinson via des outils d'IoT, Cloud, et Big Data

big-data cassandra cloud flask hadoop-hdfs iot kafka machine-learning mongodb mqtt python rest-api sickit-learn spark

Last synced: 21 Jan 2025

https://github.com/diiblo/us_election_project

Ce projet analyse les données des élections américaines 2024 avec Spark, PostgreSQL et Jupyter, dans une architecture maître-esclaves déployée via Docker.

docker jupiter-notebook postgresql-database pyspark spark

Last synced: 02 Feb 2025

https://github.com/twseptian/apache-pyspark-programming

Big Data Python Programming using Apache Spark and Pyspark

apache pyspark pyspark-mllib pyspark-notebook pyspark-tutorial spark

Last synced: 26 Dec 2024

https://github.com/amoghkori/working-with-apache-spark-mllib

Implemented Apache Spark MLLib to analyze a large car dataset, predict car selling prices, and gain insights into the car market.

amazon-web-services data-analysis data-visualization exploratory-data-analysis linear-regression machine-learning model-selection pyspark python random-forest sagemaker spark

Last synced: 23 Jan 2025

https://github.com/sabaudian/amd_market_basket_analysis

Algorithms for Massive Datasets (AMD) -- Market-baskets analysis project

frequent-itemsets mapreduce market-basket-analysis massive-datasets pyspark python python-3 spark

Last synced: 01 Jan 2025

https://github.com/sandrain/spider2-snapshot-anon

A spark script for processing (large-scale) file system snapshot data.

filesystem magpie spark workload-analysis

Last synced: 10 Jan 2025

https://github.com/giordano-lucas/movie-recommender

Movie Recommendation System (KNN) in Scala/Spark

recommander-system scala spark

Last synced: 02 Jan 2025

https://github.com/zemuldo/spark-tutorial

Jupiter notebooks for my spark tutorials.

spark spark-sql

Last synced: 27 Jan 2025

https://github.com/amirhnajafiz-university/s7cc03

Third project of Cloud Computing course.

big-data hadoop hadoop-hdfs mapreduce python python3 spark

Last synced: 26 Dec 2024

https://github.com/plandes/docker-spark-service

Extends the docker-spark, which creates a Spark cluster in a docker image by making it a service.

docker docker-spark dockerfile spark

Last synced: 02 Jan 2025

https://github.com/naramsim/dynamic-twitter-geographical-categorization

A map-reduce implementation for the categorization of Twitter tweets within dynamic geographical boundaries.

redis spark twitter

Last synced: 26 Dec 2024

https://github.com/akhich551995/data-streaming-project-airflow-kafka-spark-t-cassandra-docker

building a real-time data streaming pipeline, covering each phase from data ingestion to processing and finally storage. We'll utilize a powerful stack of tools and technologies, including Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra—all neatly containerized using Docker.

airflow airflow-dags cassandra docker kafka postgresql python spark zookeeper

Last synced: 21 Jan 2025

https://github.com/saadkh1/real-time_sales_data_pipeline_kafa_spark_cassandra_redash

This repository implements a real-time sales data pipeline leveraging Apache Kafka, Apache Spark, Apache Cassandra, and Redash. It facilitates the efficient ingestion, processing, storage, and visualization of sales data streams.

cassandra fastapi kafka redash spark

Last synced: 21 Jan 2025

https://github.com/yandex-cloud-examples/yc-data-proc-spark-pyspark

Запуск и управление приложениями для Spark и PySpark в сервисе Yandex Data Proc.

data-proc pyspark spark yandex-cloud yandexcloud

Last synced: 29 Dec 2024

https://github.com/sowrabh-m/data_processing_using_spark_flink

This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both locally and on AWS EMR.

aws aws-emr aws-s3 emr-cluster flink flink-stream-processing spark spark-flink spark-streaming

Last synced: 14 Feb 2025

https://github.com/jcguidry/flight-ml-preprocess-gcp

Continuous flight event data processing using Spark Streaming, Delta Lake storage, deployed on GCP dataproc cluster.

dataproc deltalake gcp spark spark-streaming

Last synced: 28 Dec 2024

https://github.com/tolgakmbl/sparkwithscalapractices

Spark with Scala to read MongoDb and printing documents on console

mongodb scala spark

Last synced: 08 Feb 2025

https://github.com/rmodi6/theory-of-database-systems

Homework files for CSE532 - Theory of Database Systems

database-queries hadoop ibm-db2 jdbc map-reduce spark spatial-database sql xpath xquery

Last synced: 11 Jan 2025

https://github.com/ac-gomes/systemctl_spark_jupyter-notebook

systemctl for Spark and Jupyter-notebook

jupyter-notebook spark systemctl systemd

Last synced: 02 Jan 2025

https://github.com/coherent-partners/spark-service-promotion

Promotion of Spark services from one tenant to another.

ci-cd coherent promotion service spark

Last synced: 25 Dec 2024

https://github.com/cwienberg/spark-async-map

Helper library for running blocking IO operations in Spark jobs more efficiently

scala spark

Last synced: 02 Feb 2025

https://github.com/oscarfmdc/spark-flight-delay

A model capable of predicting the arrival delay time of a commercial flight, given a set of parameters known at time of take-off.

scala spark

Last synced: 01 Feb 2025

https://github.com/elaaatif/jpeg-and-jpeg2000-compression-on-multi-node-cluster-using-hadoop-and-spark

Big Data technologies can be leveraged for efficient, distributed image compression using JPEG2000 (Spark) and JPEG (MapReduce).

cluster hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce image-compression spark

Last synced: 08 Feb 2025

https://github.com/pandabear-neil/microsoft_fabric_gems

Code Snippets, designs, and thinkings around the Microsoft Fabric Platform

data-engineering data-factory data-science data-warehouse microsoft-fabric spark

Last synced: 08 Feb 2025

https://github.com/asolimando/trap2017spark

Analysis of TRAP2017 dataset using Spark

artificial-intelligence graph-analysis machine-learning spark

Last synced: 01 Feb 2025