Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/guiferviz/tuberia

Data engineering meets software engineering

data data-engineering expectations pipeline python spark

Last synced: 20 Dec 2024

https://github.com/abronte/pysparkgateway

Connect to remote Spark clusters seamlessly.

apache-spark bigdata pyspark python spark

Last synced: 28 Oct 2024

https://github.com/puharesource/simplemavenrepository

A simple self hosted maven repository solution, written in Kotlin using the SparkJava framework.

kotlin maven repository spark sparkjava

Last synced: 01 Jan 2025

https://github.com/dustin-decker/lognom

Simple script for processing streaming data from Redis using Apache Spark

elasticsearch kafka redis spark

Last synced: 29 Jan 2025

https://github.com/yandex-cloud/yc-delta

Delta Lake для Yandex Data Processing

delta delta-lake deltalake spark yandex-cloud

Last synced: 11 Nov 2024

https://github.com/spycsh/runspec

an android streaming running app with backend based on kafka+spark+mongodb

android iot kafka leafletjs mongodb restlet spark stompwebsocket ubiquitous-computing

Last synced: 05 Dec 2024

https://github.com/anicolaspp/mapr-data-gen

Data generator for MapR Data Platform

data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark

Last synced: 16 Jan 2025

https://github.com/curycu/sparkstudy

example codes for spark sql data wrangling

scala spark

Last synced: 05 Nov 2024

https://github.com/tomwhite/disq-original

A library for manipulating bioinformatics sequencing formats in Apache Spark.

bioinformatics genomics ngs sequencing spark

Last synced: 10 Feb 2025

https://github.com/multivacplatform/multivac-kaggle-titanic

Simple example of Titanic competition by Spark 2.2

kaggle-competition machine-learning scala spark

Last synced: 12 Jan 2025

https://github.com/highoncarbs/lumberjack

:pick: Search and analyse your logs efficienlty with Lumberjack

analysis flask log logging python spark web-dashboard

Last synced: 14 Oct 2024

https://github.com/anicolaspp/maprdbconnector

An independent MapR-DB Connector for Apache Spark that fully utilizes MapR-DB secondary indexes

database-connector mapr mapr-db maprdb-spark ojai scala spark

Last synced: 16 Nov 2024

https://github.com/hdfgroup/hdf5-spark-connector

HDF5 Connector for Apache Spark

apache hdf5 spark

Last synced: 19 Dec 2024

https://github.com/enoy19/keyboard-light-composer-mc-connector

Minecraft Forge Mod to access stats in Minecraft within the Keyboard Light Composer (https://github.com/enoy19/keyboard-light-composer)

composer forge g910 keyboard light logitech minecraft mod orion rgb spark spectrum

Last synced: 17 Jan 2025

https://github.com/aromoh/keras-distributed-streaming

Distributed Keras model for making predictions of sentiment from Spanish sentences in stream context using Spark Streaming and Apache Kafka

cnn-keras kafka keras keras-tensorflow pyspark-notebook sentiment-analysis spark spark-streaming

Last synced: 03 Feb 2025

https://github.com/dllllb/ml-pipelines-tutorial

SciKit-Learn vs Apache Spark pipelines

machine-learning scikit-learn spark

Last synced: 19 Jan 2025

https://github.com/logikal-io/mindlab

Data science toolbox

data jupyterlab python spark

Last synced: 12 Oct 2024

https://github.com/andreoss/etoile

ETL on Apache Spark

etl spark

Last synced: 30 Oct 2024

https://github.com/chaokunyang/athena

A task scheduler for spark, flink, mapreduce, java, python, bash

flink hadoop mapreduce spark task-manager task-scheduler

Last synced: 19 Nov 2024

https://github.com/smsraj2001/stream-batch-processing-kafka-spark

A project which includes simulation of real time queries by kafka and performing stream and batch processing of the simulated queries by spark. Also, this follows lambda architecture, in which kafka is publisher and spark helps in subscribing

batch-processing kafka kafka-topics lambda-architecture mysql-database no-api pub-sub pyspark python3 realtime spark streaming ubuntu2204 zookeeper

Last synced: 01 Jan 2025

https://github.com/imlegend19/vidspark

VidSpark is a prototype video CMS backend system powered by spark and elasticsearch

celery elasticsearch python redis scala spark

Last synced: 14 Jan 2025

https://github.com/exacaster/markdown_frames

Markdown tables parsing to pySpark/Pandas DataFrames

pyspark pytest spark

Last synced: 11 Nov 2024

https://github.com/adrigrillo/nycsparktaxi

Apache Spark application to get the top ten frequent routes and profitable areas

big-data nyc parquet-files python spark taxi

Last synced: 10 Feb 2025

https://github.com/joeyism/commonly-used-pyspark-commands

A list of commonly used pyspark commands

common frequent pyspark python spark

Last synced: 29 Dec 2024

https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale

Fighting Bots at Scale: Identifying Bottlenecks & Best Practice

anti-cheat botting spark

Last synced: 25 Dec 2024

https://github.com/jlgarridol/tfm-fis-if

Big Data Architecture of queues for real time video processing

big-data docker kafka parkinsons-disease spark streaming streaming-video

Last synced: 13 Jan 2025

https://github.com/sebastianruizm/spark-kafka-cassandra

Demo Spark Structured Streaming + Apache Kafka + Apache Cassandra

cassandra docker kafka spark structured-streaming

Last synced: 11 Nov 2024

https://github.com/orvillex/bigdata

主要汇集了较新版的各类大数据组件的使用教程,包含但不限定MapReduce、HBase、Spark等相关主流技术。后续还将持续更进Flink等当前流行的实时计算框架。

bigdata hbase java mapreduce spark

Last synced: 25 Jan 2025

https://github.com/fiqryq/spark-minimal-gray

🥰 Simple Instagram Filter Using Spark Ar studio by Facebook.

facebook filter spark

Last synced: 26 Jan 2025

https://github.com/jacopodl/spark

Low level network library :satellite: :zap:

c low-level network network-programming networking raw raw-data raw-sockets spark

Last synced: 31 Jan 2025

https://github.com/timvisee/hhs-p7-movie-recommendation-engine

:movie_camera: Big data project for college (HHS) period 7

algorithm hadoop recommendation-engine spark

Last synced: 15 Jan 2025

https://github.com/sircamp/spark-pspectrum

P-spectrum embedding and sequence relaxation for NLP in Spark

big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum

Last synced: 20 Jan 2025

https://github.com/s8sg/spark-py-submit

A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)

cdh hdfs hdp python-library spark spark-clusters spark-job

Last synced: 01 Feb 2025

https://github.com/ging/fiware-cosmos

The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.

analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine

Last synced: 01 Nov 2024

https://github.com/neo4j-field/bigquery-connector

Bi-directional connectivity between Google BigQuery and Neo4j AuraDS

arrow-flight bigquery neo4j protobuf python spark

Last synced: 23 Dec 2024

https://github.com/omarhimada/floyo-ml-scala

Distributed ML for eCommerce platforms (recommendations, churn prediction, segmentation) written in Scala, using Spark MLlib, Elasticsearch and AWS SDK

aws ml scala spark

Last synced: 09 Feb 2025

https://github.com/mrcolorr/supreme-pancake

Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.

big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one

Last synced: 13 Nov 2024

https://github.com/wlongxiang/pyspark_docker

Run pyspark cluster with docker on your local laptop

docker docker-compose pyspark pyspark-docker pyspark-tutorial spark

Last synced: 17 Dec 2024

https://github.com/ahmetfurkandemir/hepsiburada-data-engineering-project

Hepsiburada Data Engineering Project

docker kafka pyspark spark

Last synced: 17 Jan 2025

https://github.com/chen0040/spark-opt-moea

Distributed Multi-Objective Evolutionary Computation Framework for Spark

moea multi-objective-optimization nsga-ii spark

Last synced: 09 Feb 2025

https://github.com/radeity/spark-proxy

push-based calculation for spark application

distributed-computing spark volunteer-computing

Last synced: 01 Feb 2025

https://github.com/lgautier/pragmatic-polyglot-data-analysis

Docker container for off-the-shelf jupyter notebook + Python + R + Spark/pyspark + LLVM

docker-container jupyter-notebook python r spark

Last synced: 10 Nov 2024

https://github.com/bonigarcia/spark-examples

Collection of Spark examples using Python

cassandra influxdb kafka python spark spark-streaming

Last synced: 08 Feb 2025

https://github.com/bryanyang0528/docker-cdh-spark

cdh with spark 2.2

cdh cloudera docker spark

Last synced: 12 Jan 2025

https://github.com/majobasgall/smote-mr

SMOTE-MR: A distributed Synthetic Minority Oversampling Technique (SMOTE) for Big Data which applies a MapReduce based-approach. SMOTE-MR is categorized as an `approximated/ non exact` solution. Also, there is an `exact` solution called SMOTE-BD written by the author (See: https://github.com/majobasgall/smote-bd)

big-data imbalanced-data machile-learning scala smote spark

Last synced: 08 Jan 2025

https://github.com/mobiletelesystems/spark-dialect-extension

Package extending the default dialect capabilities for Spark.

etl etl-components plugin-system spark

Last synced: 11 Oct 2024

https://github.com/conema/spark-terraform

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform

Last synced: 20 Nov 2024

https://github.com/harshoza36/movielens_pyspark

MovieLens Dataset analysis using Hadoop and Pyspark

big-data-analytics hadoop movielens movielens-data-analysis pyspark spark spark-sql

Last synced: 10 Jan 2025

https://github.com/felipekunzler/spark-twitter-analysis

Analyse a twitter dataset with Spark and vizualize the results on a React dashboard.

java reactjs scala spark

Last synced: 30 Oct 2024

https://github.com/ashton-sidhu/sysmon-extract

Extract logs based off events from sysmon. Comes as a package, cli and ui.

data-science dataengineering infosec spark streamlit sysmon threat-intelligence threathunting

Last synced: 09 Nov 2024

https://github.com/longshilin/spark-wordcount

spark wordcount example | build in Eclipse+Maven+Scala Project+Spark

helloworld maven scala scala-programming spark wordcount

Last synced: 10 Nov 2024

https://github.com/vivek-bombatkar/dataworkssummit2018_spark_ml

hands on introduction to basic Machine Learning techniques with Apache Spark ML using the cloud.

apache-spark linear-regression machine-learning spark workshop

Last synced: 08 Nov 2024

https://github.com/michabirklbauer/mahout_docker

Running Apache Mahout in Docker.

apache docker dockerfile hadoop mahout maven spark

Last synced: 04 Jan 2025

https://github.com/pedropark99/introd-pyspark

An open and introductory book for the Python API of Apache Spark (pyspark)

book pyspark python spark

Last synced: 14 Oct 2024

https://github.com/superruzafa/scala-spark-big-data

My solutions to the Coursera's Big Data Analysis with Scala and Spark course

big-data coursera scala spark

Last synced: 30 Dec 2024

https://github.com/kruglov-dmitry/yelp_data

End to end example how to read big (well, comparably) data from Kafka and write it down into Cassandra using Spark Structured Streaming. Using yelp dataset for illustration purposes.

cassandra kafka spark streaming yelp-dataset

Last synced: 19 Jan 2025

https://github.com/maxinexiong/degrees-of-separation-with-breadth-first-search

This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.

apache-spark bfs-algorithm breadth-first-search degrees-of-separation marvel-characters pyspark python spark spark-rdd

Last synced: 21 Dec 2024

https://github.com/thanaraklee/dataflow-with-gcp

This project demonstrates the workflow of a Data Engineer. It utilizes the Google Cloud Platform and Google Colab as the main tools.

airflow apache-spark data-engineering etl pandas spark

Last synced: 25 Dec 2024

https://github.com/timvw/adobe-analytics-datafeed-datasource

Apache Spark data source for Adobe Analytics Data Feed

adobe-analytics clickstream python scala spark

Last synced: 08 Nov 2024

https://github.com/triandicAnt/TwitterSentimentAnalytics

Basic Twitter Sentiment Analytics using Apache Spark Streaming APIs and Python by processing live tweets from Twitter.

machine-learning python sentimental-analysis spark twitter twitter-api twitter-sentiment-analytics

Last synced: 23 Oct 2024

https://github.com/dimajix/docker-spark

Repository for building Docker containers for Spark

cluster docker hadoop spark

Last synced: 05 Jan 2025

https://github.com/renardeinside/terrametria

Source code 3D population density map of Germany, with ETL and app logic on top the Databricks Platform.

databricks deckgl python react spark

Last synced: 03 Dec 2024

https://github.com/akarce/udacity-data-pipeline-with-airflow

Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.

airflow minio postgresql pyspark spark

Last synced: 12 Oct 2024

https://github.com/ugurcanerdogan/machine-learning-with-spark

BBM469*ASG3 - Machine Learning with Spark

apache-spark data-science machine-learning spark

Last synced: 19 Dec 2024

https://github.com/boazmohar/pysparkutils

A collection of utilities for handling pySpark's SparkContext

pyspark python spark

Last synced: 09 Feb 2025

https://github.com/mcddhub/mcdd-big-data-study

Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)

big-data data-processing docker flink hadoop kafka spark zookeeper

Last synced: 09 Feb 2025

https://github.com/omar-besbes/football-big-data

This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, RethinkDB for live data updates and Next.js for data visualization as well as a custom built search engine.

batch-processing hadoop kafka nextjs rethinkdb spark streaming t3-stack yarn

Last synced: 20 Jan 2025

https://github.com/nhviet03/is405_bigdata_mapreduce_knn

A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification on Apache Spark

knn-classification mapreduce pyspark spark

Last synced: 17 Jan 2025

https://github.com/afsalthaj/supaku-sukara

Functional Programming, Functional Programming Exercise Solutions in Scala & Spark

functional-programming functor language monad parallelism scala shapeless spark typeclasses

Last synced: 08 Jan 2025

https://github.com/inbravo/spark-movie-lens

Various examples of analytics using Apache Spark

apache-spark scala spark

Last synced: 02 Feb 2025

https://github.com/jinsyin/datalink

⚡ 数据集成 | DataLink is a lightweight data integration framework build on top of DataX, Spark and Flink

batch big-data bigdata cdc data data-collection data-exchange data-integration data-pipeline data-synchronization datalink etl flink flink-cdc framework integration pipeline spark streaming

Last synced: 15 Nov 2024

https://github.com/alfex4936/spark-studies

Apache Spark 공부 in Python

apache pyspark python spark

Last synced: 27 Jan 2025

https://github.com/alexioannides/py-readme-snippets

This repository contains snippits of writing (in Markdown), on various topics relating to various flavours of Python development project.

python python-library spark

Last synced: 17 Jan 2025

https://github.com/exacaster/delta-fetch

HTTP API on Delta Lake tables

big-data delta-lake parquet s3 spark

Last synced: 11 Nov 2024

https://github.com/kadnan/vagrant-spark2

Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).

pyspark python3 spark vagrant vagrant-boxes

Last synced: 20 Jan 2025

https://github.com/burhanahmed1/big-data-analytics

Practice tasks in Python programming language using Hadoop, MRJob, PySpark for Big Data Analytics.

apache-spark hadoop hadoop-mapreduce jupyter-notebook mrjob pyspark python spark spark-sql sparksql

Last synced: 11 Oct 2024

https://github.com/ichowdhury01/match

A social networking platform that allows users to find friends with similar interests in their area.

geolocation-api jdbc maven mysql pbkdf2 spark

Last synced: 06 Feb 2025

https://github.com/policratus/sparkmage

🐘 A tool for blazing fast analysis and clustering of similar images using 🐘 Hadoop and ⚡ Spark.

big-data computer-vision hadoop image-processing spark

Last synced: 02 Nov 2024

https://github.com/jaehyeon-kim/emr-local-dev

Spark Local Development Environment Using Docker (and vscode)

aws docker emr spark vscode

Last synced: 30 Oct 2024

https://github.com/anskarl/auxlib-spark-nlp

NLP utilities for Apache Spark

nlp opennlp scala spark

Last synced: 19 Dec 2024

https://github.com/mgrojo/adasearch

Custom search engine for the Ada programming language

ada custom-search-google search-engine spark spark-ada

Last synced: 27 Oct 2024

https://github.com/kwartile/spark-benchmark

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

apache-spark benchmark benchmarking-suite cdh cloudera-hadoop hadoop hive impala performance scala spark

Last synced: 08 Feb 2025