Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/multivacplatform/multivac-wikipedia

Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.

data-frame multivac-wikipedia spark spark-sql wikipedia

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-ml

Pre-trained ML models for Apache Spark

machine-learning nlp spark spark-ml

Last synced: 12 Jan 2025

https://github.com/kemalcanbora/ba_bigdata_docker

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries.

bigdata hadoop hue kafka spark

Last synced: 24 Jan 2025

https://github.com/gnaneshkunal/scala-hadoop

Hadoop programming using Scala

big-data bigdata hadoop scala spark sql

Last synced: 09 Feb 2025

https://github.com/782e616c6d/covid-d.a

Academic project, using Apache Spark for ETL and Data Studio for data analysis.

academic analytics automation cluster covid-19 data database etl python spark sql

Last synced: 26 Jan 2025

https://github.com/ltossian/bike-sales-data-metrics

Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.

fastapi grafana hadoop kafka postgresql python spark

Last synced: 11 Oct 2024

https://github.com/georgegkonis/spark-decentralized-query-processing

Project for the academic course "Decentralized Data Technologies"

big-data decentralized-data jupyter python query-optimization spark

Last synced: 19 Dec 2024

https://github.com/chucheng92/sparkstreamingkafka

Spark Streaming logs to kafka.

kafka spark spark-streaming streaming

Last synced: 01 Feb 2025

https://github.com/michelderu/cassandra-csv-analytics

How to leverage Astra, DSE and Spark for analytics on large CSV files.

astra cassandra spark

Last synced: 20 Jan 2025

https://github.com/declaredata/fuse_python

PySpark-compatible Python client for DeclareData Fuse Server: a blazing fast data processing engine and drop-in alternative to Spark clusters.

data-processing pyspark rust-lang spark

Last synced: 13 Jan 2025

https://github.com/pranavshashidhara/movie-recommendation-system

This project focuses on developing a recommendation system utilizing various learning techniques, including collaborative filtering, matrix factorization, and restricted Boltzmann machines (RBMs).

big-data recommendation-system spark

Last synced: 13 Jan 2025

https://github.com/tadod12/airflow-spark-job

A workspace to experiment with Apache Spark and Airflow in a Docker environment

airflow docker rdbms spark

Last synced: 13 Jan 2025

https://github.com/mauriciovazquezm/spark_bigdata_architecture_project

Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM

data-stream-processing data-streaming pyspark python spark time-series

Last synced: 13 Jan 2025

https://github.com/vubacktracking/hdfs-stream-processing

Streaming data processing using Hadoop HDFS, Spark, Kafka, Minio, Elasticsearch

airflow elastic hadoop hdfs kafka kibana minio spark

Last synced: 11 Oct 2024

https://github.com/20cent16/airflow-spark

If you want to use airflow with spark, ready to use ;-)

airflow spark

Last synced: 11 Oct 2024

https://github.com/tsovak/spark-demo

The Spark REST API with Spring Boot and MongoDB

docker-compose mongodb rest-api spark sparkjava sparkrest spring-boot

Last synced: 08 Feb 2025

https://github.com/pierrekieffer/genericsupervisedmachinelearning

Generic supervised machine learning application

machine-learning spark

Last synced: 07 Feb 2025

https://github.com/sankamuk/aws-kinesis-redshift-sparkstream

Spark Structured Streaming from AWS Kinesis and Redshift

aws kinesis pyspark redshift spark structured-streaming terraform

Last synced: 13 Jan 2025

https://github.com/manojpawar94/spark-scala-examples

I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.

apache-spark spark spark-rdd spark-sql

Last synced: 13 Jan 2025

https://github.com/fbraza/data-processing-scala-spark

A repository that contains code in Scala using spark to process a log data file. The full procedure to run the application can be read in the README.md file.

scala spark

Last synced: 26 Jan 2025

https://github.com/same-ou/spark-hdfs-ml

Spark and HDFS cluster using Docker and Docker Compose

hdfs ml spark

Last synced: 25 Dec 2024

https://github.com/exasol/spark-connector-common-java

Common library for Exasol Apache Spark based connectors

apache-spark exasol exasol-integration spark streaming

Last synced: 09 Feb 2025

https://github.com/mjngxwnj/olympics_data_project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data.

airflow docker hadoop python snowflake spark superset

Last synced: 09 Feb 2025

https://github.com/thdaraujo/cheat

A handful of cheatsheets and programming tips.

bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop

Last synced: 24 Jan 2025

https://github.com/manuparra/clustering-openstack

Make a dynamic and customizable cluster with OpenStack

cluster deployment hadoop openstack openstack-command script slave-nodes spark

Last synced: 27 Dec 2024

https://github.com/williamliu52/twitter-sc

Trending sports highlights from Twitter

nodejs python react reactjs scala spark twitter

Last synced: 23 Oct 2024

https://github.com/vitalibo/distributed-heatmap-service

Simple distributed heatmap service on top of Apache HBase

aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot

Last synced: 27 Dec 2024

https://github.com/ishaansathaye/csc369-introdistributedcomputing

Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing

distributed-computing hadoop java map-reduce scala spark

Last synced: 09 Feb 2025

https://github.com/soumyadipta2020/sparkr_test

Sample Codes of Spark using R programming

r r-coding r-programming r-programming-language spark sparkr

Last synced: 05 Jan 2025

https://github.com/beiyuouo/mi-store-log-analysis

👨‍🦽 伪·小米商城-大数据电商日志分析

flask full-stack java kafka python spark

Last synced: 02 Feb 2025

https://github.com/iamhatesz/dend-covid19

Capstone project from Udacity's Data Engineer Nanodegree program.

airflow aws redshift spark udacity udacity-data-engineer-nanodegree udacity-nanodegree

Last synced: 13 Jan 2025

https://github.com/peregin/iot-spark-rest

Example with kafka, spark, rest service

iot kafka rest scalatra spark streaming

Last synced: 13 Jan 2025

https://github.com/dodat-12/airflow-spark-job

A workspace to experiment with Apache Spark and Airflow in a Docker environment

airflow docker rdbms spark

Last synced: 20 Dec 2024

https://github.com/tonyz0x0/parallel-ml

An implementation of parallel machine learning algorithms using Spark

machine-learning python spark

Last synced: 02 Feb 2025

https://github.com/ev2900/iceberg_emr_athena

Resources from an virtual tech talk / workshop - Set Up and Use Apache Iceberg Tables on Your Data Lake

apache-iceberg athena aws emr spark

Last synced: 05 Nov 2024

https://github.com/ev2900/emr_studio_deployment

Example Jupyter notebook for EMR Studio

aws emr emr-studio spark

Last synced: 05 Nov 2024

https://github.com/e2fyi/databricks-utils

`databricks-utils` is a python package that provide several utility classes/func that improve ease-of-use in databricks notebook.

aws databricks jupyter-notebooks notebook pyspark s3 spark vega vega-lite

Last synced: 16 Jan 2025

https://github.com/bytemedirk/pyspark3-docker

PySpark3 Docker container for testing & development. With OpenJDK, Spark 3.1.2, and Hadoop 2.7.

aws docker docker-image python spark

Last synced: 13 Jan 2025

https://github.com/ezeparziale/big-data-cluster

:elephant: Cluster big data

big-data bigdata hadoop hdfs hive spark zookeeper

Last synced: 20 Jan 2025

https://github.com/facaiy/spark-for-the-impatient

Collections of short code snippet for impatient readers who want to learn using Spark right away.

spark spark-training tutorial

Last synced: 20 Jan 2025

https://github.com/bnvulpe/paperslab

The project aims to automate content classification and knowledge retrieval, as well as to perform analysis on the temporal and thematic impact on research over a time period. In addition, the possibility of performing network analysis to analyze communication in the community is contemplated for users.

api-extractor big-data big-data-and-ml big-data-infrastructure docker elasticsearch etl-pipeline information-retrieval knowledge-discovery mysql neo4j network-analysis spark temporal-analysis

Last synced: 09 Feb 2025

https://github.com/tuancamtbtx/java-spark-example

Spark ETL Generic Processor

etl spark

Last synced: 02 Jan 2025

https://github.com/tallamjr/epfl-functional-scala

Materials and worked assignments for Functional Programming with Scala Specialization on Coursera

big-data scala spark

Last synced: 10 Feb 2025

https://github.com/tallamjr/jetspark

Spark cluster on Jetson TX2 mini-project

gpu nvidia spark tx2-jetpack

Last synced: 10 Feb 2025

https://github.com/bishalpaudel/sparkhbaseloganalyzer

Spark and HBase based HApache Access Log Analyzer

big-data cloudera hbase scala spark

Last synced: 06 Jan 2025

https://github.com/shayartt/streaming-orders

Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS

databricks etl kafka python spark spark-streaming

Last synced: 12 Oct 2024

https://github.com/drsnowbird/nlp-deeplearning-projects

NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)

chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow

Last synced: 13 Jan 2025

https://github.com/marcorfilacarreras/matemaquest

A simple API to get information of the "Pruebas Canguro" exams

api docker github-actions java math mathematics spark

Last synced: 13 Jan 2025

https://github.com/andreoss/spark-tabs

Custom tabs for Spark UI

scala spark

Last synced: 10 Feb 2025

https://github.com/rishav273/spark-cluster-multi-node-setup

Quickly setup and simulate a multi node spark cluster using docker and docker-compose.

docker docker-compose pyspark python3 spark

Last synced: 11 Oct 2024

https://github.com/f-lab-edu/league-of-legends-data-solution

‘리그 오브 레전드’를 벤치마킹해서 플레이어의 행동 이벤트를 발생하는 API를 통해 실시간으로 데이터가 잘 흐를 수 있도록 데이터 솔루션을 제공합니다.

airflow dataengineering spark

Last synced: 11 Oct 2024

https://github.com/tuancamtbtx/spark-build-tool

Generate Spark Job From This Tool

java k8s spark

Last synced: 11 Oct 2024

https://github.com/fsanaulla/spark-http-rdd

RDD primitive for fetching data from an HTTP source

scala spark

Last synced: 12 Oct 2024

https://github.com/positlabs/spark-picker-animations

Animated Native UI Picker Icons in Spark AR

augmented-reality instagram spark spark-ar

Last synced: 02 Feb 2025

https://github.com/najuzilu/dl-spark

Building a Data Lake with Spark

aws-emr aws-s3 data-engineering data-lake etl-pipeline spark

Last synced: 26 Jan 2025

https://github.com/pedropark99/spark_map

Easily apply a function over multiple columns of a Spark DataFrame

pyspark python spark

Last synced: 28 Nov 2024

https://github.com/nicklitwinow/hse-python-capstone-project

This project is a comprehensive data engineering and analytics solution built using modern technologies such as Airflow, Spark, PostgreSQL, MySQL, Kafka, and Docker. It orchestrates data ingestion, processing, replication, streaming, and analytics across multiple containers.

airflow analytics dataengineering docker etl kafka mysql postgresql python spark streaming

Last synced: 03 Feb 2025

https://github.com/nkdwon/crud-spark

Um CRUD feito em Java com Integração do PostgreSQL e o Framework Spark utilizando o ambiente Eclipse

eclipse-ide git java maven pgadmin4 postgresql spark

Last synced: 06 Jan 2025

https://github.com/s8sg/spark-standalone-cluster

Spark Standalone Cluster With Zookeeper

docker docker-compose spark zookeeper

Last synced: 01 Feb 2025

https://github.com/lajwithsingh/magelocaldatapipeline

A compact project showcasing local data lake setup using Docker, Mage, Spark, MinIO, Iceberg, and StarRocks. Ideal for learning modern data engineering practices.

docker iceberg mage minio spark starrocks

Last synced: 29 Dec 2024

https://github.com/zncdatadev/spark-k8s-operator

Operator for Apache Spark-on-Kubernetes of the Kubernetes Data Stack

k8s kubernetes spark

Last synced: 19 Nov 2024

https://github.com/starhe/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成

clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks

Last synced: 21 Dec 2024

https://github.com/mukjepscarlet/bilibili-predict-recommend

[大数据课程作业] Bilibili 助手: 视频推荐 + 热门预测

bilibili flask hadoop html javascript prediction pyspark python recommendation spark

Last synced: 18 Jan 2025

https://github.com/kirbs-/svm

Apache Spark Version Manager

apache-spark spark

Last synced: 27 Jan 2025

https://github.com/jimthompson5802/datascience_containers

Personal docker images for various data science software stacks

data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark

Last synced: 29 Dec 2024

https://github.com/abdelmajidlh/spark_ml_weather

Projet d'apprentissage Scala et Spark : Prédire la pluie de demain avec des données historiques

pom scala spark spark-ml spark-sql

Last synced: 27 Jan 2025

https://github.com/hexnn/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成

clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks

Last synced: 21 Dec 2024

https://github.com/fanqingsong/machine_learning_system_on_spark

a simple machine learning system demo(cluster and predict on iris data), for ML study. Based on machine_learning_system repo, add new process for ml model service with celery and spark.

celery django machine-learning reactjs spark

Last synced: 21 Dec 2024

https://github.com/silvanheller/parquet-demo

Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV

benchmark orc parquet r scala spark university-project

Last synced: 27 Jan 2025

https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024

An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024

dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark

Last synced: 21 Jan 2025

https://github.com/tianzhipeng-git/wdsdatasource

WdsDataSource is a Spark data source implementation that allows reading and writing data in WebDataset format

spark webdataset-format

Last synced: 21 Jan 2025

https://github.com/tianzonglin/bigeyes

A distributed graph computing platform that enables simple visual analysis of large-scale relational data.

canvas distributed-computing graph-drawing spark websocket

Last synced: 30 Dec 2024

https://github.com/azavea/docker-spark

Base Docker image for Spark.

docker openjdk spark

Last synced: 07 Jan 2025

https://github.com/tomfran/lastfm-users-analysis

Last FM user's data collection and analysis using Spark

gcp lastfm spark

Last synced: 06 Jan 2025

https://github.com/omr5221/kafka-account-fraud-detector

Learning about Kafka and Spark with project built off of an existing project

kafka python spark superset

Last synced: 27 Jan 2025

https://github.com/hpgrahsl/gab2016streamanalytics

Repository with materials for my Session at Global Azure Bootcamp 2016

azure bootcamp spark storm streamanalytics

Last synced: 08 Jan 2025

https://github.com/kuro337/scalamono

Scala Monorepo Tooling for Kafka, Opensearch, Spark, Redpanda, Hadoop - and Lang Reference.

data database duckdb hadoop kafka redpanda sdala spark

Last synced: 14 Jan 2025

https://github.com/librity/rtjvm_spark_essentials

Rock The JVM - Apache Spark Essentials

apache-spark big-data docker scala spark spark-sql

Last synced: 08 Jan 2025

https://github.com/oracle-quickstart/oci-hortonworks

Terraform module to deploy Hortonworks on Oracle Cloud Infrastructure (OCI)

cloud hadoop hdf hdp hortonworks oci oracle partner-led spark terraform

Last synced: 07 Nov 2024

https://github.com/sebastianhaeni/spark-zeppelin-docker

Docker files to run Spark and Zeppelin

docker spark zeppelin

Last synced: 14 Jan 2025

https://github.com/cbhihe/mesos_on_docker

Benchmark of CPU and I/O intensive operation for Mesos on Docker with Spark

benchmarking docker mapreduce mesos spark

Last synced: 02 Jan 2025

https://github.com/tuancamtbtx/bigdata-sdk

Some Data Connector In Big Data

elasticsearch hadoop spark

Last synced: 02 Jan 2025