Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/saadsalmanakram/data-processing

This repo is focused on all key frameworks, libraries or tools use for Data Processing

big-data pandas polars spark

Last synced: 14 Feb 2025

https://github.com/fpopic/gg-interview-challenge

(Interview) GG Interview Challenge in Scala/Spark

apache-spark json logstash parsing regex scala spark sparksql

Last synced: 10 Jan 2025

https://github.com/jldbc/big-data

Coursework from Big Data (CS3390) -- Machine Learning tasks performed using Hadoop, MapReduce, and Spark

big-data hadoop pagerank recommender-system spark

Last synced: 04 Jan 2025

https://github.com/mangalaman93/dspark

Run spark in docker containers

big-data containers docker microservices spark

Last synced: 18 Jan 2025

https://github.com/mtpatter/bilao

Jupyter notebooks for filtering Kafka data with Spark Streaming.

avro docker jupyter-notebook kafka spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-fakenews

Detecting users and communities which propagate fake news on Twitter by Apache Spark

deep-learning fakenews machine-learning spark twitter

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-wikipedia

Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.

data-frame multivac-wikipedia spark spark-sql wikipedia

Last synced: 12 Jan 2025

https://github.com/multivacplatform/multivac-ml

Pre-trained ML models for Apache Spark

machine-learning nlp spark spark-ml

Last synced: 12 Jan 2025

https://github.com/kemalcanbora/ba_bigdata_docker

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries.

bigdata hadoop hue kafka spark

Last synced: 24 Jan 2025

https://github.com/kruglov-dmitry/yelp_data

End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.

cassandra kafka spark streaming yelp-dataset

Last synced: 19 Jan 2025

https://github.com/kwartile/spark-benchmark

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark

Last synced: 08 Feb 2025

https://github.com/aveek-saha/cricket-score-predictor

A Big data application to predict the outcome of a T20 cricket match.

big-data big-data-analytics clustering pyspark spark spark-mllib

Last synced: 15 Feb 2025

https://github.com/peteprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/peteprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 18 Jan 2025

https://github.com/javaidiqbal11/arabic-tweets-sentiment-analysis-using-spark

This repo is for Twitter Arabic dataset for sentiment analysis using Apache Spark.

apache-spark arabic-nlp arabic-tweets flask pyhton3 sentiment-analysis spark twitter-api

Last synced: 03 Jan 2025

https://github.com/darule0/sparkdiff

A rudimentary command line utility for contrasting Apache Spark event logs.

apache-spark compare-files diff difference diffing spark spark-sql spark-streaming sparksql

Last synced: 06 Feb 2025

https://github.com/iversonson/spark-lite-document-translator

This project aims to provide a fast and efficient document translation solution using Spark Lite's machine learning APIs

spark translation

Last synced: 17 Jan 2025

https://github.com/opt-nc/opt-temps-attente-agences-camel

Pull datas from opt-temps-attente-agences-api and store data in various systems

camel datascience dataviz glia innovation kafka opensearch relation-client spark

Last synced: 12 Dec 2024

https://github.com/fpopic/hf-interview-challenge

(Interview) Mixin Data Engineering & Data Science with PySpark

data-engineering data-science pyspark python recipes spark

Last synced: 10 Jan 2025

https://github.com/soumyadipta2020/sparkr_test

Sample Codes of Spark using R programming

r r-coding r-programming r-programming-language spark sparkr

Last synced: 05 Jan 2025

https://github.com/pedropark99/spark_map

Easily apply a function over multiple columns of a Spark DataFrame

pyspark python spark

Last synced: 28 Nov 2024

https://github.com/scrapcodes/kafkaproducer

Benchmarks to measure latency using spark and kafka.

benchmark kafka spark

Last synced: 03 Jan 2025

https://github.com/najuzilu/dl-spark

Building a Data Lake with Spark

aws-emr aws-s3 data-engineering data-lake etl-pipeline spark

Last synced: 26 Jan 2025

https://github.com/luisfalva/ophelia

Ophelian On Mars! More than a simple framework.

dask dataframe ophelia ophelia-spark rdd spark spark-ml spark-mllib spark-streaming

Last synced: 17 Dec 2024

https://github.com/viyadb/viyadb-spark

Data processing ang ingestion backend for ViyaDB based on Spark streaming

spark spark-streaming spark-streaming-kafka viyadb

Last synced: 08 Feb 2025

https://github.com/scrapcodes/spark-templates

One stop shop for Apache spark starter samples.

apache samples spark

Last synced: 03 Jan 2025

https://github.com/tianzonglin/bigeyes

A distributed graph computing platform that enables simple visual analysis of large-scale relational data.

canvas distributed-computing graph-drawing spark websocket

Last synced: 20 Feb 2025

https://github.com/bomada/sparkify

This project is the final Capstone project of the Udacity Data Scientist Nanodegree program. The aim is to learn how to manipulate realistic datasets with Spark to engineer relevant features for predicting churn. Input data is related to the fictive music streaming service Sparkify (similar to Spotify and Pandora).

churn ml music portfolio python spark streaming

Last synced: 09 Feb 2025

https://github.com/mxagar/spark_big_data_guide

This repository contains my personal guide on Spark and topics related to Big Data.

big-data hadoop machine-learning spark

Last synced: 15 Feb 2025

https://github.com/fiware/tutorials.big-data-spark

:blue_book: FIWARE 306: Real-time Processing of Context Data using Apache Spark

apache-spark big-data-analytics fiware fiware-cosmos orion-spark-connector spark tutorial

Last synced: 17 Nov 2024

https://github.com/ltossian/bike-sales-data-metrics

Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.

fastapi grafana hadoop kafka postgresql python spark

Last synced: 11 Feb 2025

https://github.com/chimera-suite/spark-sidecar-setup

The sidecar setup container executes SparkSQL scripts against an Apache Spark instance.

docker setup sidecar-container spark sparksql

Last synced: 03 Jan 2025

https://github.com/chimera-suite/use-case

A step-by-step tutorial that showcases the capabilities of Chimera

chimera jena-fuseki knowledge-graph ontology pizza spark sparql-query

Last synced: 03 Jan 2025

https://github.com/georgegkonis/spark-decentralized-query-processing

Project for the academic course "Decentralized Data Technologies"

big-data decentralized-data jupyter python query-optimization spark

Last synced: 12 Feb 2025

https://github.com/tianzhipeng-git/wdsdatasource

WdsDataSource is a Spark data source implementation that allows reading and writing data in WebDataset format

spark webdataset-format

Last synced: 21 Jan 2025

https://github.com/nikoshet/pyspark-movie-similarities

Using Spark In Python For Movie Similarities With Jaccard Index

jaccard-index movie-similarities pyspark spark

Last synced: 03 Jan 2025

https://github.com/nikoshet/spark-mlp

Multilayer Perceptron Implementation Using Spark

hdfs machine-learning mapreduce multilayer-perceptron pyspark python spark

Last synced: 03 Jan 2025

https://github.com/snexus/streaming-playground

Exploring streaming design patterns with Kafka and Spark Structural Streaming

kafka kafka-producer python spark spark-streaming

Last synced: 23 Jan 2025

https://github.com/ronaldkanyepi/log-realtime-analysis

A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.

dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization

Last synced: 16 Feb 2025

https://github.com/ishaansathaye/csc369-introdistributedcomputing

Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing

distributed-computing hadoop java map-reduce scala spark

Last synced: 09 Feb 2025

https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024

An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024

dbscan-clustering-algorithm dvc-pipeline feature-engineering hadoop k-means-clustering machine-learning mlops-workflow spark

Last synced: 21 Jan 2025

https://github.com/vitalibo/distributed-heatmap-service

Simple distributed heatmap service on top of Apache HBase

aws hbase hbase-coprocessor heatmap spark spark-sql spring-boot

Last synced: 18 Feb 2025

https://github.com/akaliutau/spark-recipes

Contains a collection of data processing solutions built on the top of Spark

java spark

Last synced: 11 Jan 2025

https://github.com/positlabs/spark-picker-animations

Animated Native UI Picker Icons in Spark AR

augmented-reality instagram spark spark-ar

Last synced: 02 Feb 2025

https://github.com/adelin-info/tp_datacloud

Architecture et développement des systèmes distribuées à large echelle

hadoop java map-reduce scala spark yarn zookeeper

Last synced: 30 Jan 2025

https://github.com/williamliu52/twitter-sc

Trending sports highlights from Twitter

nodejs python react reactjs scala spark twitter

Last synced: 23 Oct 2024

https://github.com/silvanheller/parquet-demo

Parquet demo project for the Workshop in the Course DIS. Benchmarks Parquet versus ORC, JSON and CSV

benchmark orc parquet r scala spark university-project

Last synced: 27 Jan 2025

https://github.com/coreyauger/ashley-madison-spark

Spark data analysis for the Ashley Madison dataset.

scala spark

Last synced: 16 Jan 2025

https://github.com/tuancamtbtx/bigdata-sdk

Some Data Connector In Big Data

elasticsearch hadoop spark

Last synced: 02 Jan 2025

https://github.com/rishav273/spark-cluster-multi-node-setup

Quickly setup and simulate a multi node spark cluster using docker and docker-compose.

docker docker-compose pyspark python3 spark

Last synced: 14 Feb 2025

https://github.com/andreoss/spark-tabs

Custom tabs for Spark UI

scala spark

Last synced: 10 Feb 2025

https://github.com/cbhihe/mesos_on_docker

Benchmark of CPU and I/O intensive operation for Mesos on Docker with Spark

benchmarking docker mapreduce mesos spark

Last synced: 02 Jan 2025

https://github.com/sebastianhaeni/spark-zeppelin-docker

Docker files to run Spark and Zeppelin

docker spark zeppelin

Last synced: 14 Jan 2025

https://github.com/hexnn/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j、Redis、ElasticSearch,通过标准REST接口和SQL语句操作,简单易用,方便二次开发和快速集成

clickhouse datax dolphinscheduler elasticsearch hadoop hbase hive impala kafka maxcompute neo4j phoenix presto spark starrocks

Last synced: 13 Feb 2025

https://github.com/manuparra/clustering-openstack

Make a dynamic and customizable cluster with OpenStack

cluster deployment hadoop openstack openstack-command script slave-nodes spark

Last synced: 18 Feb 2025

https://github.com/azurespheredev/microsoftfabric-exploratorium

A comprehensive educational resource hub dedicated to mastering Microsoft Fabric, offering in-depth tutorials, real-world use cases, and hands-on guides for seamless end-to-end analytics

analytics data-science data-transformation lakehouse microsoft-fabric one-lake powerbi real-time-analytics spark warehouse

Last synced: 11 Jan 2025

https://github.com/aamend/spark-archetype

Maven archetype is a convenient way to create fully fledged SPARK libraries at minimal cost

devops maven spark

Last synced: 29 Jan 2025

https://github.com/azlinrusnan/iris_pyspark_analysis

Iris Classification using PySpark

apache pyspark-mllib python r spark

Last synced: 20 Feb 2025

https://github.com/marcorfilacarreras/matemaquest

A simple API to get information of the "Pruebas Canguro" exams

api docker github-actions java math mathematics spark

Last synced: 13 Jan 2025

https://github.com/stabrise/scaledp-tutorials

Tutorials for ScaleDP library. ScaleDP is an Open-Source Library for Processing Documents in Apache Spark.

ner nlp ocr ocr-python pdf spark

Last synced: 30 Jan 2025

https://github.com/abdelmajidlh/spark_ml_weather

Projet d'apprentissage Scala et Spark : Prédire la pluie de demain avec des données historiques

pom scala spark spark-ml spark-sql

Last synced: 27 Jan 2025

https://github.com/aldantanneo/bigints

WIP constant time bigint implementation in SPARK

ada bigint cryptography formal-verification spark

Last synced: 30 Jan 2025

https://github.com/codelytv/spark-kafka_rabbitmq_sqs-course

Integrate Spark with queue system course examples

apache-spark aws-sqs kafka rabbitmq spark

Last synced: 30 Jan 2025

https://github.com/alexott/cyber-spark-data-connectors

Cybersecurity-related custom data connectors for Spark

cybersecurity databricks pyspark spark

Last synced: 30 Jan 2025

https://github.com/hungreeee/reddit-realtime-streaming-pipeline

End-to-end real-time pipeline for comments processing of any subreddit for sentiment analysis.

cassandra docker-compose kafka praw-reddit real-time reddit-api spark

Last synced: 12 Jan 2025

https://github.com/worst001/note_bigdata

收录了大数据相关各类资料、笔记、手册

bigdata cdh datawarehouse development flink flume guide hadoop hbase hive learning markdown mkdocs note notebook spark

Last synced: 12 Jan 2025

https://github.com/antonio-f/big-data-analysis-with-scala-and-spark

Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).

big-data bigdata coursera data-analysis scala spark

Last synced: 06 Feb 2025

https://github.com/kuro337/scalamono

Scala Monorepo Tooling for Kafka, Opensearch, Spark, Redpanda, Hadoop - and Lang Reference.

data database duckdb hadoop kafka redpanda sdala spark

Last synced: 14 Jan 2025

https://github.com/782e616c6d/covid-d.a

Academic project, using Apache Spark for ETL and Data Studio for data analysis.

academic analytics automation cluster covid-19 data database etl python spark sql

Last synced: 26 Jan 2025

https://github.com/thdaraujo/cheat

A handful of cheatsheets and programming tips.

bash cheat-sheets cheatsheet dms hadoop postgresql spark sqoop

Last synced: 24 Jan 2025

https://github.com/hpgrahsl/gab2016streamanalytics

Repository with materials for my Session at Global Azure Bootcamp 2016

azure bootcamp spark storm streamanalytics

Last synced: 08 Jan 2025

https://github.com/rupeshtr78/awsiot

AWS IOT Intergration Using EMR Spark Kinesis

aws aws-emr dynamodb iot kinesis spark spark-streaming

Last synced: 12 Jan 2025

https://github.com/talmago/pyspark-loglikelihood

PySpark Loglikelihood Similarity Examples

mahout pyspark recommendation-engine spark

Last synced: 03 Feb 2025

https://github.com/drsnowbird/nlp-deeplearning-projects

NLP Deep Learning Projects (Warning - Not ready for public consumption yet!)

chatbot deep-learning mallet nlp python3 rasa-core rasa-nlu spark tensorflow

Last synced: 13 Jan 2025

https://github.com/rupeshtr78/aws-emr

Spark Job on Amazon EMR cluster

aws cluster emr-cluster mapreduce mapredue scala spark

Last synced: 12 Jan 2025

https://github.com/mounirbs/spark-connect

Spark Connect, a docker-compose solution enabling a Spark Cluster with Spark Connect feature. Could be used for local development.

apache apache-spark docker docker-compose pyspark python spark

Last synced: 15 Feb 2025

https://github.com/harborzeng/gangsutils

Scala spark project useful tool pack

scala spark

Last synced: 29 Jan 2025

https://github.com/tuancamtbtx/spark-build-tool

Generate Spark Job From This Tool

java k8s spark

Last synced: 13 Feb 2025

https://github.com/kirbs-/svm

Apache Spark Version Manager

apache-spark spark

Last synced: 27 Jan 2025

https://github.com/michelderu/cassandra-csv-analytics

How to leverage Astra, DSE and Spark for analytics on large CSV files.

astra cassandra spark

Last synced: 20 Jan 2025

https://github.com/starhe/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成

clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks

Last synced: 13 Feb 2025