Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/ltossian/bike-sales-data-metrics

Traitement, stockage, analyse et visualisation d'un fichier csv volumineux et de données en temps réel de ventes de vélos.

fastapi grafana hadoop kafka postgresql python spark

Last synced: 11 Feb 2025

https://github.com/hwywl/bigdata

大数据学习代码Spark、Hive、Storm、HBase

big-data flume hbase hdfs hive mr spark storm zook

Last synced: 08 Jan 2025

https://github.com/iversonson/spark-lite-document-translator

This project aims to provide a fast and efficient document translation solution using Spark Lite's machine learning APIs

spark translation

Last synced: 17 Jan 2025

https://github.com/crazybber/go-jupyter

spark big data exploring in jupyterlab

bigdata jupyter-notebook jupyterlab rdd spark

Last synced: 28 Jan 2025

https://github.com/azlinrusnan/movielens_data_analysis_with_mongodb_and_cassandra

This project presents an analysis of the MovieLens 100k dataset using Apache Spark integrated with MongoDB and Cassandra. The dataset includes user information, movie ratings, and movie details, providing a comprehensive basis for exploring user preferences and movie popularity.

cassandra ml-100k mongodb python spark

Last synced: 17 Jan 2025

https://github.com/divithraju/divith-raju-data-mining

This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.

algorthims analytics apache business client connector data dataarchitecture database dataengineering datamining datascience hadoop k-means-clustering mysql project project-repository pyspark python3 spark

Last synced: 17 Jan 2025

https://github.com/vasnake/artefacts-2019_2023

Collection of some interesting pieces of my projects. Spark, Scala, Python, sh

catalyst etl ml scala spark udaf udf

Last synced: 17 Jan 2025

https://github.com/inf0rmatiker/model-service

A service providing federated model training for spatially-segregated data.

python spark

Last synced: 08 Jan 2025

https://github.com/sebastianruizm/pyspark-graphframes

Análisis de datos con GraphFrames y PySpark

python spark sql

Last synced: 08 Jan 2025

https://github.com/zoltan-nz/learning-spark

Playing with Apache Spark

apache-spark java map-reduce spark

Last synced: 22 Jan 2025

https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks

San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.

databricks pyspark spark sql

Last synced: 16 Nov 2024

https://github.com/adrianmarino/spark-examples

Spark install & examples

notebooks python spark

Last synced: 24 Jan 2025

https://github.com/mohnoor94/learningspark

My journey to learn Spark using Scala <3

learning learning-by-doing scala spark sparkscala

Last synced: 22 Jan 2025

https://github.com/ejw-data/google-colab-etl-amazon-reviews

Using Spark and Amazon RDS to clean and summarize amazon reviews to determine usefulness of product feedback

amazon-rds spark

Last synced: 22 Jan 2025

https://github.com/sandeepkundalwal/network-load-analysis-using-apache-spark

[CS561: MapReduce & BigData] Streaming Service using Apache Spark

big-data css html java javascript mapreduce spark

Last synced: 02 Feb 2025

https://github.com/tupol/spark-utils-demos

Demos for the tupol/spark-utils project together with a storyline

configuration demo framework scala spark

Last synced: 17 Jan 2025

https://github.com/tupol/spark-apps.seed.g8

Create Spark applications projects based on the spark-utils library.

application scala spark template

Last synced: 17 Jan 2025

https://github.com/amthorn/qutex

A basic Queue Management System, interactable via several mediums, that resembles a mutex.

ava bot bots cisco cisco-spark cisco-spark-bot mutex queue queuebot queues qutex spark thorn webex webex-teams

Last synced: 13 Nov 2024

https://github.com/samuele-lolli/steam-recommendation-system

A basic recommendation system built with Scala and Spark

mapreduce scala spark

Last synced: 04 Feb 2025

https://github.com/pprattis/road-safety-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of the UK Ministry of Transport's database using Apache Spark RDD for query implementation.

computer-science index java jdbc jdbc-database partitions pgadmin postgresql program query spark spark-sql sparkjava sql student

Last synced: 04 Feb 2025

https://github.com/pprattis/insurance-company-database-with-jdbc-and-spark-rdd

A jdbc application that runs queries in pgAdmin to simulate the functionality of an insurance company's database using Apache Spark RDD for query implementation.

computer-science java jdbc jdbc-database partitioning partitions postgresql program query spark spark-sql sparkjava sql student

Last synced: 04 Feb 2025

https://github.com/rockfordwei/anagram

Anagram Solution Servers in Different Languages/Frameworks

anagram hdfs java javascript php python server spark swift

Last synced: 12 Jan 2025

https://github.com/alimarzouk/paris-aq

ELTL pipeline to monitor air quality in the Paris Île-de-France area

airflow airquality big-data bigquery dataengineering gcs spark

Last synced: 22 Jan 2025

https://github.com/angeligareta/spark-hadoop-hbase-overview

First lab for Data-Intensive Computing course at KTH where we are introduced to Apache Spark MLlib and Spark SQL, Hadoop, and HBase.

apache-spark data-intensive hadoop hbase hbase-table id2221 kth scala spark spark-mllib spark-sql

Last synced: 22 Jan 2025

https://github.com/angeligareta/spark-kafka-cassandra-overview

Second lab for Data-Intensive Computing course at KTH where we use Apache Kafka, Spark, and Cassandra to practice stream processing.

apache-kafka apache-spark cassandra cassandra-server data-intensive id2221 kafka kafka-topic kth scala spark stream-processing

Last synced: 22 Jan 2025

https://github.com/angeligareta/machine-learning-spark

Assignment for Scalable Machine Learning which aims to study the basics of regression and classification in Spark.

apache-spark machine-learning scala spark spark-classification spark-ml spark-mllib spark-regression spark-scala

Last synced: 22 Jan 2025

https://github.com/tomwhite/single-cell-spark-demo

Experiments on Single Cell data from 10x Genomics using Apache Spark.

bioinformatics genomics single-cell spark

Last synced: 17 Jan 2025

https://github.com/georgegkonis/spark-decentralized-query-processing

Project for the academic course "Decentralized Data Technologies"

big-data decentralized-data jupyter python query-optimization spark

Last synced: 12 Feb 2025

https://github.com/coreyauger/ashley-madison-spark

Spark data analysis for the Ashley Madison dataset.

scala spark

Last synced: 16 Jan 2025

https://github.com/wadiebenabdouh/socialmedia-usage-pipeline

Data from Kaggle, containing wide range of users with different age, gender, and interest.

apache-spark data-visualization jupyter-notebook kaggle pyspark python spark

Last synced: 16 Jan 2025

https://github.com/azlinrusnan/iris_pyspark_analysis

Iris Classification using PySpark

apache pyspark-mllib python r spark

Last synced: 31 Dec 2024

https://github.com/ralgond/bigdata-example

Hadoop、Hive和Spark的例子、细节和注意事项

bigdata hadoop hdfs hive map-reduce spark

Last synced: 09 Jan 2025

https://github.com/kampi/particle-mqtt

MQTT client implementation for TCP supporting devices (i. e. Argon, Photon) from Particle IoT.

cpp mqtt particle-argon particle-iot particle-swarm-optimization spark

Last synced: 21 Jan 2025

https://github.com/fiware/tutorials.big-data-spark

:blue_book: FIWARE 306: Real-time Processing of Context Data using Apache Spark

apache-spark big-data-analytics fiware fiware-cosmos orion-spark-connector spark tutorial

Last synced: 17 Nov 2024

https://github.com/aamend/spark-archetype

Maven archetype is a convenient way to create fully fledged SPARK libraries at minimal cost

devops maven spark

Last synced: 29 Jan 2025

https://github.com/harborzeng/gangsutils

Scala spark project useful tool pack

scala spark

Last synced: 29 Jan 2025

https://github.com/snexus/streaming-playground

Exploring streaming design patterns with Kafka and Spark Structural Streaming

kafka kafka-producer python spark spark-streaming

Last synced: 23 Jan 2025

https://github.com/brooksian/twittersentimentsparkcorenlp

Twitter Sentiment Analysis Using Spark CoreNLP

nlp-machine-learning spark sparksql zeppelin-notebook

Last synced: 18 Nov 2024

https://github.com/ngone51/spark-read

This is a project recording the reading process of Spark(v2.4) source code personally.

source-code spark study

Last synced: 18 Nov 2024

https://github.com/darenr/spark-pca

Dimensional reduction, Scatter, Hexbin and kde plots

pca python spark

Last synced: 05 Feb 2025

https://github.com/dunnkers/pyspark-bucketmap

Easily group pyspark data into buckets and map them to different values.

bucketizer categorizer pyspark pyspark-mllib python python3 spark

Last synced: 29 Jan 2025

https://github.com/izeigerman/twinkle

The collection of helpers and utils for Apache Spark

apache-spark scala spark

Last synced: 08 Feb 2025

https://github.com/giuliosmall/twitter-trending-topics-pipeline

This project demonstrates trending topic detection using Apache Spark and MinIO. It processes Twitter JSON data with PySpark, leveraging distributed data processing and cloud storage. The entire project is containerized with Docker for easy deployment across architectures.

docker minio nlp pyspark pytest spacy spark streamlit

Last synced: 05 Feb 2025

https://github.com/annettaqi/spam-detection

Using Stochastic gradient descent to classify emails into spam or ham

spark stochastic-gradient-descent

Last synced: 13 Feb 2025

https://github.com/binwenwu/oge-computation-ogc

A computing project corresponding to an OGC style API

geotrellis scala spark

Last synced: 13 Feb 2025

https://github.com/briansterle/cluster-fastcopy

copy data between hdfs clusters blazingly fast

bigdata distcp hadoop hdfs spark yarn

Last synced: 13 Feb 2025

https://github.com/mahi97/internship-elk-loganalysis

~ The Report of Development and Deployment an ELK Stack for MCI BI softwares and servers to perform real-time log analysis

elasticsearch kafka kibana latex logstash mesos redis spark

Last synced: 05 Feb 2025

https://github.com/danimonsalve/scala_spark

Aplicación en Scala que utiliza Apache Spark para clasificar ofertas de empleo según los lenguajes de programación mencionados en las ofertas de empleo. El objetivo es demostrar diferentes técnicas de clasificación y procesamiento de datos en grandes volúmenes de datos.

rdd scala spark

Last synced: 17 Jan 2025

https://github.com/vubacktracking/stream-data-processing

Streaming data processing pipeline using Spark, PostgreSQL, Debezium, Kafka, Minio, Delta Lake, Trino and DBeaver

dbeaver debezium delta-lake kafka spark spark-streaming stream-processing trino

Last synced: 17 Jan 2025

https://github.com/antonio-f/big-data-analysis-with-scala-and-spark

Coding assignments from the course "Big Data Analysis with Scala and Spark" (Coursera).

big-data bigdata coursera data-analysis scala spark

Last synced: 06 Feb 2025

https://github.com/talmago/pyspark-loglikelihood

PySpark Loglikelihood Similarity Examples

mahout pyspark recommendation-engine spark

Last synced: 03 Feb 2025

https://github.com/tuancamtbtx/spark-build-tool

Generate Spark Job From This Tool

java k8s spark

Last synced: 13 Feb 2025

https://github.com/renardeinside/databricks-jobs-jsonnet

Example project with Databricks jobs and configuration management via jsonnet

databricks jsonnet spark

Last synced: 06 Feb 2025

https://github.com/shayartt/streaming-orders

Project to stream real-time orders and apply some ETL pipelines & analytics using DataBricks, Kafka, AWS

databricks etl kafka python spark spark-streaming

Last synced: 13 Feb 2025

https://github.com/imvision12/real-time-tracking

Real time bus tracking using MTA bus API

flask hadoop javascript leaflet python spark

Last synced: 08 Feb 2025

https://github.com/wtsi-hgi/hgi-cloud

terraform and ansible codebase to provision clusters (e.g. hail/spark) at Sanger

ansible hail iac openstack packer spark terraform

Last synced: 28 Nov 2024

https://github.com/cn-docker/spark-master

Spark Master Docker Image

docker-image spark spark-master

Last synced: 27 Jan 2025

https://github.com/nwtgck/spark-wikipedia-dump-loader

Wikipedia Dump Loader for Spark

scala spark wikipedia-dump

Last synced: 06 Feb 2025

https://github.com/nhsdigital/mps_diagnostics

Interpretable metadata for the results of NHS England record linkage

data-linkage data-science nhs-digital nhs-england pyspark record-linkage spark

Last synced: 23 Dec 2024

https://github.com/starhe/balm

基于Spring Boot全家桶打造,大数据PAAS组件适配器,一键适配DolphinScheduler、Hadoop、Spark、Hive、Impala、HBase、Kafka、StarRocks、ClickHouse、Neo4j,通过标准REST接口操作,简单易用,方便二次开发和集成

clickhouse dolphinscheduler hadoop hbase hive impala kafka neo4j spark spring starrocks

Last synced: 13 Feb 2025

https://github.com/zkan/machine-learning-with-spark-and-zeppelin

Machine Learning with Apache Spark & Zeppelin

pyspark python spark zeppelin

Last synced: 12 Feb 2025

https://github.com/wgierke/distributed_data_analytics

Solutions for the hands-on sessions of the course "Distributed Data Analytics" at Hasso-Plattner-Institute using Akka and Spark.

akka data-analytics distributed inclusion-dependency spark

Last synced: 09 Feb 2025

https://github.com/jimthompson5802/datascience_containers

Personal docker images for various data science software stacks

data-science docker h2oai jupyter-notebook kubernetes python rstudio-servers spark

Last synced: 13 Feb 2025

https://github.com/darule0/yarndiff

A rudimentary command line utility for contrasting Apache Yarn container logs.

diff difference diffing hadoop hadoop-mapreduce hive log4j mapreduce pig spark yarn yarn2

Last synced: 23 Dec 2024

https://github.com/casassg/thesis

Undergraduate final thesis: Big Data Analytics on Container Orchestrated Systems

casassg-thesis cassandra docker kubernetes latex spark thesis zeppelin

Last synced: 17 Dec 2024

https://github.com/darule0/sparkdiff

A rudimentary command line utility for contrasting Apache Spark event logs.

apache-spark compare-files diff difference diffing spark spark-sql spark-streaming sparksql

Last synced: 06 Feb 2025

https://github.com/opt-nc/opt-temps-attente-agences-camel

Pull datas from opt-temps-attente-agences-api and store data in various systems

camel datascience dataviz glia innovation kafka opensearch relation-client spark

Last synced: 12 Dec 2024

https://github.com/luisfalva/ophelia

Ophelian On Mars! More than a simple framework.

dask dataframe ophelia ophelia-spark rdd spark spark-ml spark-mllib spark-streaming

Last synced: 17 Dec 2024

https://github.com/bomada/sparkify

This project is the final Capstone project of the Udacity Data Scientist Nanodegree program. The aim is to learn how to manipulate realistic datasets with Spark to engineer relevant features for predicting churn. Input data is related to the fictive music streaming service Sparkify (similar to Spotify and Pandora).

churn ml music portfolio python spark streaming

Last synced: 09 Feb 2025

https://github.com/mxagar/spark_big_data_guide

This repository contains my personal guide on Spark and topics related to Big Data.

big-data hadoop machine-learning spark

Last synced: 23 Dec 2024

https://github.com/neo4j-field/end-to-end-fraud-demo

An example of how to load the data backing Zach's awesome Fraud Demo

graph-algorithms neo4j spark

Last synced: 23 Dec 2024

https://github.com/adelin-info/tp_datacloud

Architecture et développement des systèmes distribuées à large echelle

hadoop java map-reduce scala spark yarn zookeeper

Last synced: 30 Jan 2025

https://github.com/melezhik/sparrowdo-spark

Quick Spark Installer for CentOS and Docker

centos spark sparrowdo

Last synced: 23 Dec 2024

https://github.com/ev2900/emr_studio_stock_price_demo

Demo EMR Studio notebook using PySpark to explore Stock Price Data

aws emr emr-studio spark

Last synced: 23 Dec 2024

https://github.com/ev2900/glue_spark_history_server

Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs

aws glue spark spark-history-server spark-ui

Last synced: 23 Dec 2024

https://github.com/chrispyl/learning-latent-representations-for-nitrogen-response-rate-prediction

Implementation for the paper 'Learning latent representations for operational nitrogen response rate prediction'

neural-networks python spark

Last synced: 17 Jan 2025

https://github.com/vicnesterenko/apache-spark-labs

Base programs with datasets

apache-spark kpi-fict kpi-ua spark

Last synced: 10 Jan 2025

https://github.com/pierrekieffer/sparkstreaming_kafkaconsumer

Kafka consumer example based on spark streaming with message formatting to spark dataframe

kafka kafka-consumer scala spark spark-streaming

Last synced: 07 Feb 2025