Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/dmschauer/wap-pattern-iceberg-pyspark-aws-glue

About This repository shows how to implement the Write-Audit-Publish (WAP) pattern using Apache Spark and Apache Iceberg. It's aimed at Data Engineers who want to get started quickly.

apache-iceberg apache-spark aws aws-glue iceberg pyspark spark

Last synced: 31 Dec 2024

https://github.com/teo-sl/us_flights_analysis

This repository contains a dashboard to visualize the US flights data and notebooks for some ML tasks on the same data

big-data classification dash dashboard flights machine-learning plotly regression spark usa

Last synced: 16 Jan 2025

https://github.com/teo-sl/denver-crimes-dash

This is a simple dashboard made with Dash and Plotly (for the frontend) and Apache Spark (for the backend)

dash denver plotly spark

Last synced: 16 Jan 2025

https://github.com/ankushkhanna/spark-common

Spark Commons, some hacks to simplify programming with Spark.

spark transfomer

Last synced: 17 Jan 2025

https://github.com/alexdyysp/sparkscala

learning notebook

spark

Last synced: 19 Jan 2025

https://github.com/luismanuelamengual/neogroup-sparks

Great Server framework with MVC oriented structure

framework mvc-pattern spark

Last synced: 22 Jan 2025

https://github.com/gabrieltempass/sparkify-churn-prediction

A model to predict churn for a music streaming company, with Spark running on an AWS EMR cluster.

aws big-data spark

Last synced: 06 Feb 2025

https://github.com/nwtgck/wikipedia-word2vec-playground-spark

A playground of word2vec from Wikipedia Dump with Spark

scala spark word2vec

Last synced: 06 Feb 2025

https://github.com/moriyoshi/dummydf

Emulates PySpark DataFrame API by Pandas

dummy pandas pyspark spark testing

Last synced: 22 Jan 2025

https://github.com/grihabor/spark-metrics-otel-collector

Example spark metrics configuration with opentelemetry collector

metrics opentelemetry opentelemetry-collector spark

Last synced: 13 Feb 2025

https://github.com/nwtgck/spark-wikipedia-dump-loader-example

An example of spark-wikipedia-dump-loader

example scala spark wikipedia-dump

Last synced: 06 Feb 2025

https://github.com/vitalibo/distributed-alarm-system

Simple distributed alarm system on top of Apache Spark

aws azure spark

Last synced: 27 Dec 2024

https://github.com/datawaver/emre-airflow

Use Airflow to create and run Spark Jobs with an EMRE Spark cluster

airflow aws aws-emr docker spark

Last synced: 17 Jan 2025

https://github.com/xpcosmos/jaffle-shop

Modern Data Stack with DBT, PySpark, PostgresSQL and Docker

dbt docker docker-compose pyspark python spark

Last synced: 17 Jan 2025

https://github.com/alchemine/realtime-trend-pipeline

실시간 검색어에 대한 수집/분석 데이터 파이프라인

airflow docker hadoop hive kafka python selenium spark

Last synced: 16 Jan 2025

https://github.com/vitalibo/aws-glue-java

Simple PoC that demonstrate usage Java in AWS Glue ETL pipelines.

aws glue spark

Last synced: 27 Dec 2024

https://github.com/lorenzobloise/motion_insights

Application for real-time big data analysis from a Body Sensor Network, developed using Spark in Scala and Kafka

angular chartjs kafka real-time scala spark spark-sql spark-streaming

Last synced: 17 Jan 2025

https://github.com/vigneshss-07/bigdata_technologies

This repo contains all technical knowledge and implementation of big data technologies.

big-data hadoop hadoop-hdfs hbase hive hive-metastore kafka mapreduce-python pyspark spark sparksql

Last synced: 16 Jan 2025

https://github.com/springworks/node-scale-reader

Application that reads value off a scale from a local file stream

spark

Last synced: 22 Jan 2025

https://github.com/javadbahoosh/spark-streaming-multi-language-docker

Dockerized infrastructure and boilerplate code for consuming Kafka topics with Spark Streaming in Scala, Python, and Java, featuring Redis integration for result aggregation.

docker kafka spark

Last synced: 30 Jan 2025

https://github.com/hatamiarash7/kubernetes-spark

Deploy Apache Spark cluster in Kubernetes

apache apache-spark kubernetes spark

Last synced: 03 Feb 2025

https://github.com/bst-depractice/spark_play

Setup and practice pyspark transformations

spark

Last synced: 22 Jan 2025

https://github.com/francesco-biscaccia-carrara/bigdata_projects

Assignment repository for the Big Data Computing course at the University of Padova for the academic year 2023-2024.

big-data k-center-problem map-reduce reservoir-sampling spark spark-streaming sticky-sampling

Last synced: 22 Jan 2025

https://github.com/non-neutralzero/spark-feature-engineering-toolkit

Snippets of spark/scala code used to do some handy feature engineering

data-engineering feature-engineering feature-extraction scala spark spark-sql

Last synced: 22 Jan 2025

https://github.com/dutrevis/spark-resources-metrics-plugin

Spark plugin to retrieve metrics from a variety of cluster resources

scala spark

Last synced: 30 Dec 2024

https://github.com/erikgartner/prometheus-cc-extractor

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.

big-data common-crawl data-extraction mapreduce spark

Last synced: 06 Feb 2025

https://github.com/hafizhhasyhari/Big-Data_AI_Streaming-Data-Visualization-2024

Big Data dengan Spark, Scala by hafizhhasyhari. Kuliah selama semester 3

big-data data-analytic data-besar data-science scala spark

Last synced: 08 Feb 2025

https://github.com/safaa-p/machine-failure-prediction

Predicting Machine failure using Machine learning on a synthetic dataset of an existing milling machine consisting of 10,000 data points

big-data classification clustering decision-tree machine-failure machine-learning neural-network pyspark-mllib spark spark-sql svm-classifier unbalanced-data

Last synced: 16 Jan 2025

https://github.com/lucivpav/bachelors-thesis

Source code of my Bachelor's thesis that was made at CTU FIT.

ctu-fit dnbc fit-ctu hmm mle scala spark thesis

Last synced: 02 Feb 2025

https://github.com/roaajadaa/sparksearchengine

Build a small-scale spark-based search engine which searches in a list of documents to find those answering a user’s query.

bigdata indexing inverted-index mongodb scala search-engine spark

Last synced: 13 Feb 2025

https://github.com/jonny-binns/fase-practicals

Contains the practice work and code from lectures in the Formal Approaches to Software Engineering Module

ada ada-spark spark

Last synced: 18 Jan 2025

https://github.com/seilylook/spark_definition_guide_ch_3

Spark: The Definition Guide - Chapter 3

spark structured-streaming

Last synced: 16 Jan 2025

https://github.com/billxsheng/oubre-sentiment-analysis

Complete data platform that performs sentiment analysis on tweets. Built using Cassandra, Kafka, Spark, Node, and React.

cassandra etl-pipeline java kafka nodejs sentiment-analysis spark twitter-api

Last synced: 17 Jan 2025

https://github.com/coreyauger/spark-ashley-madison-ml

spark ml hacking on the Ashley Madison dataset.

scala spark

Last synced: 16 Jan 2025

https://github.com/dev88jerry/cs450

Bishop's University - CS450 Elements of Big Data

big-data data-science hadoop spark

Last synced: 08 Jan 2025

https://github.com/spektom/data-formats-samples

Spark-based different data formats samples generator

avro json orc parquet spark

Last synced: 20 Jan 2025

https://github.com/mustafahakkoz/preprocessing_w_spark

Preprocessing + Feature Extraction pipeline by SPARK and Neo4J

cypher feature-engineering mongodb neo4j nlp spark

Last synced: 28 Jan 2025

https://github.com/brinthat/world-development-indicators

Exploring World Development Indicators: Identifying relationship between Health Indicators using Linear Regression & Classification of Income Group based on Health Indicators using Logistic Regression.

lasso-regularization linear-regression python ridge-regression-model scatter-plot spark spark-sql

Last synced: 17 Jan 2025

https://github.com/ssanthosh010303/collection-data-training

A collection of challenges exercised during data training program.

airflow apache azure azure-data-factory azure-databricks azure-logic-apps bigdata data hadoop spark

Last synced: 17 Jan 2025

https://github.com/majobasgall/bash_scripts_potpourri

A little bit of everything: cleanup Arch-based systems, extraction, system info, git config, project structuring, spark parameters, data backup, webcam control, and more!

archlinux bash-script data-science github-config manjaro-linux spark

Last synced: 08 Jan 2025

https://github.com/sharoonjoseph321/insurance_fraud_detection

Fraud Detection using machine learning algorithm-KN Neighbors .Data exploration using Pyspark and matplotlib.

analytics data data-science eda high-performance knn-algorithm knn-classification machine-learning matplotlib-pyplot pyspark python seaborn spark statistics

Last synced: 28 Jan 2025

https://github.com/holdenk/luigi-rewrite-rug

WIP Proof of Concept Luigi Pipeline Rewriting for testing using rug

atomist luigi rug spark

Last synced: 31 Jan 2025

https://github.com/asora6/java-8-features

Welcome to the Java 8 Features Repository! This repository highlights key features of Java 8, such as Lambda Expressions, Functional Interfaces, Method References, Constructor References, Stream API, and the Local Date and Time API. It includes notes and programs for practical understanding. Explore and enhance your knowledge of Java 8 features!

functional-interfaces gradle-plugin java java-date java7 java8-examples javascript jdk8 lambda-expressions lambda-functions nashorn optional spark streams-api

Last synced: 28 Jan 2025

https://github.com/rickymiura/slack-posts-eda

In this repository, I perform EDA on a large dataset of Slack posts using Apache Spark and AWS to efficiently uncover trends and insights at scale.

big-data distributed-computing spark

Last synced: 28 Jan 2025

https://github.com/abinba/pe-analyzer

Preprocessor of PE (Portable Executable) files (dll, exe) using Spark.

pefile portable-executable postgres pyspark python spark

Last synced: 17 Jan 2025

https://github.com/simplexspatial/simplexspatial-data-distribution-analysis

Analisys of data distribution of OSM dataset.

osm4scala scala spark

Last synced: 15 Jan 2025

https://github.com/sai-mohan-b/spark-structured-streaming

This repo is for the Structured Streaming and Projects

pyspark-notebook spark spark-streaming

Last synced: 09 Jan 2025

https://github.com/dnyfzr/reactor

⚡ Spark data processing tools

devops python spark sql

Last synced: 08 Jan 2025

https://github.com/dina-hosny/data-engineering-capstone-project

Data Engineering Capstone Project - Udacity Data Engineering Expert Track.

analytics cassandra data-engineering data-pipelines data-science etl fwd spark udacity

Last synced: 13 Jan 2025

https://github.com/maxyermayank/spark-redis-python

Batch writes into Redis using Python and Spark

aws aws-memory-db bulk-loader memorydb redis redis-client redis-py spark

Last synced: 17 Jan 2025

https://github.com/mxagar/data_engineering_guide

Personal notes on the IBM Data Engineering Certificate as well as other sources focusing on AWS.

airflow aws data-lake data-modeling data-pipelines data-science no-sql spark sql warehouse

Last synced: 23 Dec 2024

https://github.com/sanogotech/docker-airflowsparkkafkadata-engineeringend-to-end

Docker Apache Airflow Data Engineering End-to-End Project — Spark, Kafka, Airflow, Docker, Cassandra, Python

airflow cassandra cassandra-database dataengineering docker kafka python spark

Last synced: 23 Jan 2025

https://github.com/johngodoi/scalasparkkafka

This code just loads data to kafka through apache spark and reads it back.

docker docker-compose kafka spark spark-kafka spark-sql spark-streaming

Last synced: 06 Feb 2025

https://github.com/santiagortiiz/platzi-aws-bigdata

Platzi. School of Amazon Web Services. Big Data in AWS.

apache aws aws-glue big-data etl pipelines platzi redshift spark zeppelin

Last synced: 08 Jan 2025

https://github.com/zkan/dtc-data-engineering-zoomcamp

DataTalks.Club's Data Engineering Zoomcamp

airflow data-engineering dbt docker kafka spark

Last synced: 12 Feb 2025

https://github.com/tupol/spark-xkmeans

Extension to the standard K-Means implementation of Spark ML library

clustering kmeans kmeans-clustering library machine-learning scala spark

Last synced: 17 Jan 2025

https://github.com/cvinicius987/projetos-bigdata

Estudos de caso envolvendo projetos de BigData e Engenharia de Dados.

bigdata data data-engineering spark

Last synced: 12 Jan 2025

https://github.com/kajal-52/spark-with-scala-learning

This project covers examples about RDD, dataframe, and dataset API with Apache Spark using Scala.

scala spark spark-ml

Last synced: 28 Jan 2025

https://github.com/dirmeier/spark-travis

Testing Apache Spark using Travis.

apache-spark spark travis

Last synced: 17 Jan 2025

https://github.com/andrearettaroli/simulated-transactions-big-data

The goal of this notebook is to analyze and extract some useful informations from kaggle simulated-transactions dataset

emr notebook scala spark tableau

Last synced: 04 Jan 2025

https://github.com/hrolive/patc-big-data-analytics-bsc

Introduction to the main concepts and technologies related to Big Data and Data Analytics and its applications to real projects.

analytics bias big-data data-analysis hadoop hpc machine-learning mapreduce nosql python spark spark-streaming visualization

Last synced: 04 Jan 2025

https://github.com/baptvit/big_data

My courses and activities in Big Data

big-data hadoop hbase hive kafka mapreduce oozie pig python3 scala spark zookeeper

Last synced: 15 Jan 2025

https://github.com/f-lab-edu/commerce-sessionization

사용자 행동 데이터 세션화를 위한 Spark-Airflow 파이프라인 구축

airflow scala spark

Last synced: 01 Feb 2025

https://github.com/chen0040/spark-tabular-analytics

Spark statistical inference framework for performing column pair-wise data analytics for large data table

anova chi-square-test confidence-intervals data-analysis hypothesis-testing spark statistical-inference tabular-data

Last synced: 09 Feb 2025

https://github.com/swarup4741/spark

A CLI to bootstrap a vanilla js project

cli javascript spark vanillajs

Last synced: 08 Jan 2025

https://github.com/samuelbarbosadev/justweb_technical_test

Esse é um teste técnico para a vaga de Desenvolvedor Python Pleno.

django python spark spark-sql

Last synced: 27 Jan 2025

https://github.com/timvisee/hhs-p7-spark-docker

:whale: Docker container for Spark on college (HHS).

college docker docker-container jupyter-notebook pyspark spark

Last synced: 15 Jan 2025

https://github.com/pixelbyaj/apache-spark

Start Apache Spark with Python - pyspark

apache-spark pyspark-python python spark winutils

Last synced: 17 Jan 2025

https://github.com/dexterposh/azurehdinsight

Repository housing the artifacts to deploy the Hadoop clusters on Azure for my learning.

azure hadoop hdinsight-cluster learning-by-doing spark

Last synced: 04 Jan 2025

https://github.com/chen0040/spark-ml-commons

Package provides common utility for spark ml

spark spark-ml

Last synced: 09 Feb 2025

https://github.com/robertdavidwest/blog-post-analysis

Analysing blog post data using Spark/Scala

scala spark

Last synced: 15 Jan 2025

https://github.com/sebastianruizm/cca175-exam-preparation

Backup de mi preparación para el examen CCA175 de Cloudera

hdfs mysql python spark sqoop

Last synced: 08 Jan 2025

https://github.com/chimera-suite/test-spark-datatypes

Test OntopSpark's OBDA datatype conversions to be compliant with W3C standards

obda spark test

Last synced: 03 Jan 2025

https://github.com/ammahmoudi/mapreduce-examples

Map Rduce Examples using pure Scala and Then using Spark

map-reduce mapreduce scala spark spark-mapreduce

Last synced: 15 Jan 2025

https://github.com/mineshmelvin/aws-glue-scala-accident-analysis

This guide outlines procedures for developing Apache Spark jobs in Scala for AWS Glue deployment. It covers setting up environment variables, installing IntelliJ IDEA with the Scala plugin, and creating a Scala Maven project serving as a starting point for developers leveraging Spark in Glue scalable Data Processing applications.

aws glue mysql s3 scala spark

Last synced: 09 Feb 2025

https://github.com/shuuji3/spark-ceph-connector

🌟Spark Ceph Connector: Implementation of Hadoop Filesystem API for Ceph

apache-hadoop apache-spark ceph hadoop spark

Last synced: 27 Jan 2025

https://github.com/jofaval/mnist

Computer Vision with Neural Networks for handwritten digits recognition in 1998

classification computer-vision data-science deep-learning deep-neural-networks google-colab keras matplotlib pyspark python pytorch seaborn spark tensorflow

Last synced: 04 Feb 2025

https://github.com/armahdavi/big_data_spark_building_iot_sensor_ieq_analytics_ml_thermal_comfort_murb_retrofit_social_housing

This repository summarizes my analytics, big data, and ML code work from a Multi-Unit Residential Building (MURB) retrofit project run back during my Ph.D.,

matplotlib-pyplot pandas polars psypy pyspark pythermalcomfort python seaborn sklearn spark sparksql stata

Last synced: 13 Feb 2025

https://github.com/captainirs/hadoop-yarn-k8s

A sandbox for running a Hadoop-YARN cluster on Kubernetes

hadoop kubernetes spark yarn

Last synced: 11 Jan 2025

https://github.com/yosrak5/data-streaming

This project involves the development of a robust data engineering pipeline that orchestrates the seamless ingestion, processing, and storage of data .

airflow-dags apache cassandra docker etl kafka python spark

Last synced: 11 Dec 2024

https://github.com/iampavangandhi/sparkfoundationtask

Spark Foundation Basic UI Task

css3 html5 spark

Last synced: 17 Jan 2025

https://github.com/offthetab/vkapi-ml-dataharvester

Pipeline to harvest data via VK API for ML analysis with hadoop and spark

hadoop hdfs hive linux mariadb python requests spark sqoop

Last synced: 30 Dec 2024

https://github.com/ayresgneto/use-case-gcp-etl

ELT pipeline GCP. Tecnologias utilizadas: Postgresql, GCP Storage, Airflow (local), Pyspark (local), BigQuery

airflow big-data bigquery data data-engineering etl gcp pipeline postgresql programming-oriented-object pyspark python spark

Last synced: 21 Jan 2025