Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/build-on-aws/ci-cd-serverless-spark

Sample CI/CD pipeline for using GitHub Actions with Amazon EMR Serverless Spark.

amazon-emr apache-spark aws github-actions serverless spark

Last synced: 26 Dec 2024

https://github.com/udao-moo/udao-spark-optimizer

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

knobs-tuning modeling multi-objective-optimization optimization spark sparksql

Last synced: 11 Oct 2024

https://github.com/kimtth/pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python

Last synced: 25 Dec 2024

https://github.com/garystafford/dataproc-workflow-templates

Demonstration of Google Cloud Dataproc Workflow Templates

dataproc gcp google-cloud-platform hadoop pyspark spark

Last synced: 06 Dec 2024

https://github.com/ren294/covid-data-process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.

airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql

Last synced: 11 Oct 2024

https://github.com/ryanchao2012/sparktw.codefight.web

This is a leetcode-style website for sparktw hackathon competition.

codefight django hackathon leetcode spark

Last synced: 13 Nov 2024

https://github.com/radanalyticsio/workshop-notebook

Basic Jupyter notebook for learning Spark and OpenShift

containers data-science jupyter openshift spark

Last synced: 05 Nov 2024

https://github.com/iaja/scalaLDAvis

Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA Topic Modelling Visualisation

apache lda machine-learning scala spark visulization

Last synced: 13 Nov 2024

https://github.com/setl-framework/setl-template

A simple template to start a project with SETL

etl framework scala setl spark template

Last synced: 13 Nov 2024

https://github.com/vitalibo/spark-aws-orchestration

Deployment/Orchestration of Apache Spark applications on Amazon EMR.

aws cloudformation emr spark step-functions

Last synced: 07 Nov 2024

https://github.com/djamelinfo/randomwalk-timeseriesgenerator-on-spark

This is a generator, where a random number is drawn from a Gaussian distribution N(0,1), then at each time point a new number is drawn from this distribution and added to the value of the last number.

data-mining indexing java random randomwalk scala spark time-series

Last synced: 17 Dec 2024

https://github.com/fatmali/pyspark-openmrs-etl

Apache Spark notebook to perform ETL processes on OpenMRS data

debezium docker kafka openmrs python spark

Last synced: 28 Oct 2024

https://github.com/aureliusivan/spotify-recommender-system-with-word2vec

This project is a recommender system for Spotify songs. The system uses the Word2Vec model to find similar songs based on the song's lyrics.

keras mongodb spark word2vec

Last synced: 27 Oct 2024

https://github.com/surajiyer/python-data-utils

🚀 Utility classes and functions for common data science libraries

clustering etc matplotlib multiview-clustering nlp pandas sklearn spark statsmodels utilities

Last synced: 04 Feb 2025

https://github.com/manuparra/instalacion-bigdata-upnavarra

Taller de instalación de Hadoop, HDFS, Spark, Scala y R para DataMining / ML en modo Multi nodo

bigdata hadoop hdfs multinode scala setup spark workshop

Last synced: 07 Nov 2024

https://github.com/renardeinside/wikiflow

Wikipedia updates streaming, transformation and visualisation

akka-http apache-spark kafka spark spark-streaming visualization wikipedia

Last synced: 12 Dec 2024

https://github.com/myxof/sparknotes

Spark 2.0学习笔记

distributed-computing spark spark-sql

Last synced: 15 Oct 2024

https://github.com/laravel/spark-next-docs

The Spark documentation.

laravel paddle php spark stripe

Last synced: 04 Feb 2025

https://github.com/wittline/moving-average-spark

How to Compute Moving Average with Spark

databricks hadoop moving-average spark

Last synced: 14 Oct 2024

https://github.com/mliarakos/spark-typed-ops

Lightweight type-safe operations for Spark

scala scala-macros shapeless spark spark-scala spark-sql

Last synced: 01 Feb 2025

https://github.com/sircamp/mushrooms-ml-classfier-scala-spark

These are different machine learning algorithms used to classify and predict the poisoning of mushrooms

classification machine-learning machine-learning-algorithms scala spark

Last synced: 19 Nov 2024

https://github.com/mgarralda/hadoop-spark-cluster

Repository containing Docker images for create a cluster Spark on Hadoop Yarn.

hadoop-hdfs spark spark-cluster spark-hadoop spark-hadoop-docker spark-yarn-docker

Last synced: 11 Nov 2024

https://github.com/fpoli/view-spark-timeline

Visualize in an SVG the timeline of an Apache Spark execution.

cli spark visualization

Last synced: 15 Oct 2024

https://github.com/minhthong582000/my-data-stack

A simple Big data stack with Docker

docker docker-compose hadoop spark

Last synced: 13 Jan 2025

https://github.com/geotrellis/geotrellis-streaming-demo

A demo project that shows a GeoTrellis streaming application example

geotrellis gis kafka spark streaming

Last synced: 11 Nov 2024

https://github.com/flaviostutz/spark-scala-jupyter

Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master

hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql

Last synced: 24 Oct 2024

https://github.com/duynguyenhoang/til

📝 Today I Learned

bigdata spark til

Last synced: 30 Nov 2024

https://github.com/fancellu/spark-streaming-examples

A few Spark Streaming examples

akka spark spark-streaming

Last synced: 10 Nov 2024

https://github.com/leozqin/etl-markup-toolkit

ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration

etl pyspark spark

Last synced: 29 Nov 2024

https://github.com/bitlap/kspark

Kotlin for Apache Spark

apache-spark kotlin spark

Last synced: 19 Jan 2025

https://github.com/aixhunter/spark-k8s-pod-template

Steps to deploy a Spark app to Kubernetes cluster using spark-submit or a pod template

k8s kubernetes pod spark spark-cluster spark-submit

Last synced: 25 Jan 2025

https://github.com/dineshkarthik/n-gram_processor

Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.

ngrams pyspark spark

Last synced: 17 Nov 2024

https://github.com/americanexpress/bloom

BLooM is a configuration driven bigdata framework to load massive data into MemSQL

bulk-loader java memsql spark

Last synced: 10 Nov 2024

https://github.com/r-spark/sparklyr.confluent.avro

Confluent Schema Registry avro support for sparklyr

avro confluent rstats spark sparklyr

Last synced: 12 Jan 2025

https://github.com/abronte/pysparkproxy

Seamlessly execute pyspark code on remote clusters

bigdata pyspark python spark

Last synced: 28 Oct 2024

https://github.com/manuelgil/vscode-codeigniter4-spark

CodeIgniter 4 Spark is a Visual Studio Code extension that provides a set of useful commands and shortcuts for CodeIgniter 4 framework.

codeigniter commands spark vscode vscode-extension

Last synced: 19 Nov 2024

https://github.com/aamend/ml-registry

Enabling continuous delivery and improvement of Spark pipeline models through devops methodology and ML governance

datascience devops machinelearning maven ml nexus spark

Last synced: 08 Nov 2024

https://github.com/cclient/spark-java-mongo-demo

hadoop-on-mongo demo 迁移至 spark-on-hadoop-mongo 再迁移至 mongo-spark-connector

hadoop mongodb spark

Last synced: 16 Nov 2024

https://github.com/cclient/spark-streaming-kafka-offset-mysql

mysql 维护 kafka offset,支持追踪并回滚到某个'异常'时间点,重新消费

mysql offset spark spark-streaming

Last synced: 16 Nov 2024

https://github.com/fpopic/wt-interview-challenge

(Interview) WT Data Engineer Interview Challenge

csv mysql scala spark spark-dataset sparksql

Last synced: 10 Jan 2025

https://github.com/zejnilovic/scala-spark-template.g8

Scala + Spark template using Giter8

giter8-template sbt scala spark template

Last synced: 07 Nov 2024

https://github.com/absaoss/spark-data-standardization

A library for Spark that helps to stadardize any input data (DataFrame) to adhere to the provided schema.

data-quality data-structures scala schema spark

Last synced: 07 Nov 2024

https://github.com/sankamuk/spark-kubernetes

Production run of Apache Spark on Kubernetes

airflow apache-spark iac kubernetes spark

Last synced: 13 Nov 2024

https://github.com/salva/fastdbfs

fastdbfs - An interactive command line client for Databricks DBFS.

databricks dbfs spark

Last synced: 16 Nov 2024

https://github.com/comcast/pxscene-ui

A declarative JavaScript library for building React-ish UI components for pxScene (aka Spark) apps

component frontend javascript library px2react pxscene react spark ui

Last synced: 14 Nov 2024

https://github.com/tjc-lp/spark-instructor

A library for building structured LLM responses with Spark

databricks llm pydantic pydantic-v2 spark

Last synced: 12 Oct 2024

https://github.com/cumberlandgroup/node-red-contrib-spark

Node-RED Nodes to integrate with the Cisco Webex Teams API

cisco node-red spark

Last synced: 02 Feb 2025

https://github.com/erikerlandson/spark-tekton-demo

demo of running apache spark jobs using tekton and s2i workflows

apache-spark kubernetes openshift openshift-pipelines s2i spark tekton tekton-pipelines tektoncd

Last synced: 06 Jan 2025

https://github.com/9bow/komoranrestapiserver

Simple RESTful API Server for KOMORAN

java-8 komoran nlp restful-api spark sparkjava sparkjava-framework

Last synced: 18 Dec 2024

https://github.com/rootsongjc/spark-on-k8s

Spark on kubernetes 中文文档 - https://jimmysong.io/spark-on-k8s

kubernetes spark

Last synced: 27 Oct 2024

https://github.com/rberenguel/identity-graphs

Presentation about Graphframes and how we handle graphs with more than 2 billion nodes at Hybrid Theory

graphframes spark

Last synced: 02 Feb 2025

https://github.com/pedro-manoel/iot-analytics-solution-tcc

🎓 Repositório com a solução de IoT Analytics desenvolvida como parte do Trabalho de Conclusão de Curso (TCC) do curso de Ciência da Computação da Universidade Federal de Campina Grande (UFCG)

analytics business-intelligence druid iot iot-analytics kafka nifi real-time spark spark-structured-streaming superset tcc ufcg

Last synced: 28 Jan 2025

https://github.com/kairen/spark-ceph-example

Learning how to integrate Ceph S3 with Spark.

ceph java librados rados spark

Last synced: 26 Dec 2024

https://github.com/jaehyeon-kim/iceberg-etl-demo

Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment

datawarehousing emr etl iceberg scd spark

Last synced: 17 Dec 2024

https://github.com/joristruong/youtube-setl

Youtube SETL is a project that aims at providing a project exercise to practice the SETL Framework: https://github.com/JCDecaux/setl

etl exercise practice scala setl-framework spark

Last synced: 18 Dec 2024

https://github.com/fiqryq/native-ui-spark-ar

Script Native UI Spark AR.

ar facebook js spark

Last synced: 26 Jan 2025

https://github.com/canerturkseven/forecastflowml

🧙 Scalable machine learning forecasting framework with Pyspark

forecasting lightgbm machine-learning python spark time-series xgboost

Last synced: 27 Oct 2024

https://github.com/zero323/dlt

Mirror of https://gitlab.com/zero323/dlt

apache-spark delta delta-io delta-lake r rstats spark sparkr

Last synced: 27 Oct 2024

https://github.com/aikuyun/spark-all

Spark core sql streaming mllib

ml mllib scala spark

Last synced: 17 Dec 2024

https://github.com/wangshibiaoflytiger/springmvc

java spring项目开发脚手架,主要用于学习和技术调研. 涉及的相关技术(spring + springboot + gradle项目构建 + mybatisplus + redis + HikariCP数据源 + 定时任务 + aop切面 + 自定义filter + 自定义拦截器 + 阿里云对象存储oss + kafka消息队列 + 认证授权shiro + scala和java混合编程 + 大数据spark + orm springdatajpa + orm jooq + jacoco生成测试报告 + sonar生成项目分析报告)

aop cron filter hikaricp interceptor jacoco java jooq jpa kafka mybatis orm oss redis scala shiro sonar spark spring springboot

Last synced: 10 Nov 2024

https://github.com/atalii/adage

ada privilege escalation

ada security spark sudo

Last synced: 26 Oct 2024

https://github.com/shutterstock/spark-phrases

phrase detection using Google's Word2phrase

ml nlp python spark spark-ml

Last synced: 21 Jan 2025

https://github.com/walshydev/spark-ratelimiter

Easy rate-limit implementation for SparkJava.

java rate-limiter rate-limiting ratelimit spark sparkjava

Last synced: 08 Nov 2024

https://github.com/konradmalik/spark-kafka-cassandra

This is an example/demo of Kafka - Spark Streaming - Cassandra/Kafka interoperability, with Spark Streaming as a focal point.

cassandra kafka spark spark-streaming streaming

Last synced: 16 Nov 2024

https://github.com/abhirockzz/synapse-azure-data-explorer-101

Getting started with Azure Synapse and Azure Data Explorer

azure-data-explorer azure-synapse-analytics pyspark python spark

Last synced: 21 Dec 2024

https://github.com/harpin-ai/toolkit-examples

Examples for trying out the harpin AI identity resolution and data quality toolkit

data-engineering data-quality dedupe deduplication entity-resolution identity identity-resolution spark

Last synced: 01 Nov 2024

https://github.com/cbozan/graduation-project

Graduation project categorizes popular search phrases using Python and Spark and presents them on a website to inspire creators.

crisp-dm data-cleaning data-science machine-learning nlp nlp-machine-learning spark spark-mllib

Last synced: 23 Nov 2024

https://github.com/brooksian/censusecon

Data Mining Census ECON using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 18 Nov 2024

https://github.com/sneaksanddata/hadoop-fs-wrapper

Python Wrappers for Hadoop FileSystem

distributed-computing hadoop spark

Last synced: 11 Nov 2024

https://github.com/utnaf/neo4j-connector-apache-spark-notebooks

Collection of notebooks to get started with Neo4j Connector for Apache Spark

neo4j spark

Last synced: 14 Dec 2024

https://github.com/amadeusitgroup/elastic-scaling

Elastic scaling is a library that allows to control the number of resources (executors or workers) instantiated by a Spark Structured Streaming Job in order to optimize the effective microbatch duration.

spark spark-structured-streaming

Last synced: 10 Nov 2024

https://github.com/brooksian/sbir_tfidf_kmeans

Document clustering using KMeans on TF/IDF features on Small Business Innovation Research (SBIR) data

machine-learning spark sparksql zeppelin-notebook

Last synced: 18 Nov 2024

https://github.com/chezou/sparklytd

spaklyr plugin for td-spark to connect TD from R

dplyr spark sparklyr treasuredata

Last synced: 15 Oct 2024

https://github.com/smola/spark-glusterfs-example

An example of Apache Spark integration with GlusterFS.

example-project glusterfs maven scala spark

Last synced: 09 Feb 2025

https://github.com/samueleresca/deequ.net

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

apache-spark bigdata deequ dotnet spark

Last synced: 11 Nov 2024

https://github.com/us8945/aws_emr_pysparkling

Set Up Python environment on AWS EMR cluster with H2O Sparkling Water (Pysparling)

aws emr h2o jupyter-notebook pyspark pysparkling spark sparkling-water

Last synced: 21 Jan 2025

https://github.com/cevheri/spark-tutorial

Apache Spark Tutorial - Scala, Java, Python code samples

cassandra java kafka kafka-consumer kafka-producer mongodb python scala spark spark-streaming

Last synced: 09 Nov 2024

https://github.com/nmarus/node-red-contrib-spark

Node-RED Nodes to integrate with the Cisco Webex Teams API

cisco node-red spark

Last synced: 25 Oct 2024

https://github.com/hafen/strata2017

Repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial at Strata Hadoop World 2017

r spark tutorial

Last synced: 27 Oct 2024

https://github.com/nickjer/singularity-rstudio-spark

Apache Spark with RStudio and the sparklyr package in a Singularity container

rstudio-server singularity-image spark

Last synced: 14 Nov 2024

https://github.com/dllllb/ml-pipelines-tutorial

SciKit-Learn vs Apache Spark pipelines

machine-learning scikit-learn spark

Last synced: 19 Jan 2025

https://github.com/denuvosoftwaresolutions/fighting-bots-at-scale

Fighting Bots at Scale: Identifying Bottlenecks & Best Practice

anti-cheat botting spark

Last synced: 25 Dec 2024

https://github.com/timvisee/hhs-p7-movie-recommendation-engine

:movie_camera: Big data project for college (HHS) period 7

algorithm hadoop recommendation-engine spark

Last synced: 15 Jan 2025

https://github.com/conema/spark-terraform

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform

Last synced: 20 Nov 2024