Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/melezhik/sparrowdo-spark

Quick Spark Installer for CentOS and Docker

centos spark sparrowdo

Last synced: 23 Dec 2024

https://github.com/ev2900/emr_studio_stock_price_demo

Demo EMR Studio notebook using PySpark to explore Stock Price Data

aws emr emr-studio spark

Last synced: 23 Dec 2024

https://github.com/ev2900/glue_spark_history_server

Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs

aws glue spark spark-history-server spark-ui

Last synced: 23 Dec 2024

https://github.com/manuparra/clustering-openstack

Make a dynamic and customizable cluster with OpenStack

cluster deployment hadoop openstack openstack-command script slave-nodes spark

Last synced: 27 Dec 2024

https://github.com/thdaraujo/cheat

A handful of cheatsheets and programming tips.

bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop

Last synced: 24 Jan 2025

https://github.com/tianzhipeng-git/wdsdatasource

WdsDataSource is a Spark data source implementation that allows reading and writing data in WebDataset format

spark webdataset-format

Last synced: 21 Jan 2025

https://github.com/chrispyl/learning-latent-representations-for-nitrogen-response-rate-prediction

Implementation for the paper 'Learning latent representations for operational nitrogen response rate prediction'

neural-networks python spark

Last synced: 17 Jan 2025

https://github.com/hwywl/bigdata

大数据学习代码Spark、Hive、Storm、HBase

big-data flume hbase hdfs hive mr spark storm zook

Last synced: 08 Jan 2025

https://github.com/iversonson/spark-lite-document-translator

This project aims to provide a fast and efficient document translation solution using Spark Lite's machine learning APIs

spark translation

Last synced: 17 Jan 2025

https://github.com/positlabs/spark-picker-animations

Animated Native UI Picker Icons in Spark AR

augmented-reality instagram spark spark-ar

Last synced: 02 Feb 2025

https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024

An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024

dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark

Last synced: 21 Jan 2025

https://github.com/oracle-quickstart/oci-hortonworks

Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)

cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform

Last synced: 07 Nov 2024

https://github.com/vicnesterenko/apache-spark-labs

Base programs with datasets

apache-spark kpi-fict kpi-ua spark

Last synced: 10 Jan 2025

https://github.com/rockfordwei/anagram

Anagram Solution Servers in Different Languages/Frameworks

anagram hdfs java javascript php python server spark swift

Last synced: 12 Jan 2025

https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer

Kafka consumer example based on spark streaming with message formatting to spark dataframe

kafka kafka-consumer scala spark spark-streaming

Last synced: 07 Feb 2025

https://github.com/darenr/spark-pca

Dimensional reduction, Scatter, Hexbin and kde plots

pca python spark

Last synced: 05 Feb 2025

https://github.com/hsm207/demo-spark-weaviate

How to set up a dev environment to work with spark and weaviate

big-data etl kafka python spark weaviate

Last synced: 14 Jan 2025

https://github.com/silvanheller/parquet-demo

Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV

benchmark orc parquet r scala spark university-project

Last synced: 27 Jan 2025

https://github.com/kuro337/scalamono

Scala Monorepo Tooling for Kafka, Opensearch, Spark, Redpanda, Hadoop - and Lang Reference.

data database duckdb hadoop kafka redpanda sdala spark

Last synced: 14 Jan 2025

https://github.com/dunnkers/pyspark-bucketmap

Easily group pyspark data into buckets and map them to different values.

bucketizer categorizer pyspark pyspark-mllib python python3 spark

Last synced: 29 Jan 2025

https://github.com/vietdoo/sg-property-hub

SG Property Hub is a comprehensive platform for managing and analyzing property data.

airflow celery-redis crawler etl etl-pipeline fastapi minio mongodb nextjs postgresql s3 spark webscraping

Last synced: 07 Feb 2025

https://github.com/ronaldkanyepi/log-realtime-analysis

A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.

dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization

Last synced: 25 Dec 2024

https://github.com/zncdatadev/spark-k8s-operator

Operator for Apache Spark-on-Kubernetes of the Kubernetes Data Stack

k8s kubernetes spark

Last synced: 19 Nov 2024

https://github.com/mjngxwnj/olympics_data_project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data.

airflow docker hadoop python snowflake spark superset

Last synced: 09 Feb 2025

https://github.com/hpgrahsl/gab2016streamanalytics

Repository with materials for my Session at Global Azure Bootcamp 2016

azure bootcamp spark storm streamanalytics

Last synced: 08 Jan 2025

https://github.com/izeigerman/twinkle

The collection of helpers and utils for Apache Spark

apache-spark scala spark

Last synced: 08 Feb 2025

https://github.com/fsanaulla/spark-http-rdd

RDD primitive for fetching data from an HTTP source

scala spark

Last synced: 12 Oct 2024

https://github.com/fanqingsong/machine_learning_system_on_spark

a simple machine learning system demo(cluster and predict on iris data), for ML study. Based on machine_learning_system repo, add new process for ml model service with celery and spark.

celery django machine-learning reactjs spark

Last synced: 21 Dec 2024

https://github.com/ferranbt/sparkanywhere

Run Apache spark multicloud and serverless

kubernetes serverless spark

Last synced: 01 Jan 2025

https://github.com/exasol/spark-connector-common-java

Common library for Exasol Apache Spark based connectors

apache-spark exasol exasol-integration spark streaming

Last synced: 09 Feb 2025

https://github.com/782e616c6d/covid-d.a

Academic project, using Apache Spark for ETL and Data Studio for data analysis.

academic analytics automation cluster covid-19 data database etl python spark sql

Last synced: 26 Jan 2025

https://github.com/tuancamtbtx/spark-build-tool

Generate Spark Job From This Tool

java k8s spark

Last synced: 11 Oct 2024

https://github.com/giuliosmall/twitter-trending-topics-pipeline

This project demonstrates trending topic detection using Apache Spark and MinIO. It processes Twitter JSON data with PySpark, leveraging distributed data processing and cloud storage. The entire project is containerized with Docker for easy deployment across architectures.

docker minio nlp pyspark pytest spacy spark streamlit

Last synced: 05 Feb 2025

https://github.com/alimarzouk/paris-aq

ELTL pipeline to monitor air quality in the Paris Île-de-France area

airflow airquality big-data bigquery dataengineering gcs spark

Last synced: 22 Jan 2025

https://github.com/vermicida/data-lake

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

aws-s3 data-engineering data-lake etl-pipeline python spark

Last synced: 26 Dec 2024

https://github.com/arun-george-zachariah/twitteranalytics

Web application to visualize interesting analytic Spark SQL queries executed on tweets for five famous brands namely Adidas, Nike, Puma, Skechers, and Reebok.

analytics distributed-computing docker spark twitter

Last synced: 26 Dec 2024

https://github.com/same-ou/spark-hdfs-ml

Spark and HDFS cluster using Docker and Docker Compose

hdfs ml spark

Last synced: 25 Dec 2024

https://github.com/fbraza/data-processing-scala-spark

A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.

scala spark

Last synced: 26 Jan 2025

https://github.com/manojpawar94/spark-scala-examples

I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.

apache-spark spark spark-rdd spark-sql

Last synced: 13 Jan 2025

https://github.com/crazybber/go-jupyter

spark big data exploring in jupyterlab

bigdata jupyter-notebook jupyterlab rdd spark

Last synced: 28 Jan 2025

https://github.com/tuancamtbtx/etl-spark-k8s

ETL With Apache Spark Deployed on K8s

apache k8s spark spark-sql spark-streaming

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/python-spark-example

Spark template to submit to cluster

python spark

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/bigdata-spark-processing

Spark Batch Process

spark

Last synced: 02 Jan 2025

https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/hexnn/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成

clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks

Last synced: 21 Dec 2024

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/abdelmajidlh/spark_ml_weather

Projet d'apprentissage Scala et Spark : Prédire la pluie de demain avec des données historiques

pom scala spark spark-ml spark-sql

Last synced: 27 Jan 2025

https://github.com/f-lab-edu/league-of-legends-data-solution

‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.

airflow dataengineering spark

Last synced: 11 Oct 2024

https://github.com/rishav273/spark-cluster-multi-node-setup

Quickly setup and simulate a multi node spark cluster using docker and docker-compose.

docker docker-compose pyspark python3 spark

Last synced: 11 Oct 2024

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks

San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.

databricks pyspark spark sql

Last synced: 16 Nov 2024

https://github.com/sankamuk/aws-kinesis-redshift-sparkstream

Spark Structured Streaming from AWS Kinesis and Redshift

aws kinesis pyspark redshift spark structured-streaming terraform

Last synced: 13 Jan 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/lajwithsingh/magelocaldatapipeline

A compact project showcasing local data lake setup using Docker, Mage, Spark, MinIO, Iceberg, and StarRocks. Ideal for learning modern data engineering practices.

docker iceberg mage minio spark starrocks

Last synced: 29 Dec 2024

https://github.com/andreoss/spark-tabs

Custom tabs for Spark UI

scala spark

Last synced: 10 Feb 2025

https://github.com/pierrekieffer/genericsupervisedmachinelearning

Generic supervised machine learning application

machine-learning spark

Last synced: 07 Feb 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/marcorfilacarreras/matemaquest

A simple API to get information of the "Pruebas Canguro" exams

api docker github-actions java math mathematics spark

Last synced: 13 Jan 2025

https://github.com/azlinrusnan/movielens_data_analysis_with_mongodb_and_cassandra

This project presents an analysis of the MovieLens 100k dataset using Apache Spark integrated with MongoDB and Cassandra. The dataset includes user information, movie ratings, and movie details, providing a comprehensive basis for exploring user preferences and movie popularity.

cassandra ml-100k mongodb python spark

Last synced: 17 Jan 2025

https://github.com/drsnowbird/nlp-deeplearning-projects

NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)

chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow

Last synced: 13 Jan 2025

https://github.com/shayartt/streaming-orders

Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS

databricks etl kafka python spark spark-streaming

Last synced: 12 Oct 2024

https://github.com/divithraju/divith-raju-data-mining

This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.

algorthims analytics apache business client connector data dataarchitecture database dataengineering datamining datascience hadoop k-means-clustering mysql project project-repository pyspark python3 spark

Last synced: 17 Jan 2025

https://github.com/tsovak/spark-demo

The Spark REST API with Spring Boot and MongoDB

docker-compose mongodb rest-api spark sparkjava sparkrest spring-boot

Last synced: 08 Feb 2025

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/bishalpaudel/sparkhbaseloganalyzer

Spark and HBase based HApache Access Log Analyzer

big-data cloudera hbase scala spark

Last synced: 06 Jan 2025

https://github.com/vasnake/artefacts-2019_2023

Collection of some interesting pieces of my projects. Spark, Scala, Python, sh

catalyst etl ml scala spark udaf udf

Last synced: 17 Jan 2025

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/tallamjr/jetspark

Spark cluster on Jetson TX2 mini-project

gpu nvidia spark tx2-jetpack

Last synced: 10 Feb 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/mahi97/internship-elk-loganalysis

~ The Report of Development and Deployment an ELK Stack for MCI BI softwares and servers to perform real-time log analysis

elasticsearch kafka kibana latex logstash mesos redis spark

Last synced: 05 Feb 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/librity/rtjvm_spark_essentials

Rock The JVM - Apache Spark Essentials

apache-spark big-data docker scala spark spark-sql

Last synced: 08 Jan 2025

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/20cent16/airflow-spark

If you want to use airflow with spark, ready to use ;-)

airflow spark

Last synced: 11 Oct 2024

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/inf0rmatiker/model-service

A service providing federated model training for spatially-segregated data.

python spark

Last synced: 08 Jan 2025

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/sebastianruizm/pyspark-graphframes

Análisis de datos con GraphFrames y PySpark

python spark sql

Last synced: 08 Jan 2025