Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/jgperrin/net.jgp.books.spark.ch12

Spark in Action, 2nd edition - chapter 12 - Transforming your data

apache-spark java java8 manning spark sparkwithjava transformation

Last synced: 09 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch99

Spark in Action, 2nd edition - chapter 99

apache-spark java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/iaja/scalaLDAvis

Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA Topic Modelling Visualisation

apache lda machine-learning scala spark visulization

Last synced: 13 Nov 2024

https://github.com/leozqin/etl-markup-toolkit

ETL Markup Toolkit is a spark-native tool for expressing ETL transformations as configuration

etl pyspark spark

Last synced: 29 Nov 2024

https://github.com/keks51/spark-salesforce

spark salesforce connector

salesforce soap spark sparkstreaming

Last synced: 31 Dec 2024

https://github.com/geotrellis/geotrellis-streaming-demo

A demo project that shows a GeoTrellis streaming application example

geotrellis gis kafka spark streaming

Last synced: 11 Nov 2024

https://github.com/ryanchao2012/sparktw.codefight.web

This is a leetcode-style website for sparktw hackathon competition.

codefight django hackathon leetcode spark

Last synced: 13 Nov 2024

https://github.com/flaviostutz/spark-scala-jupyter

Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master

hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql

Last synced: 24 Oct 2024

https://github.com/kimtth/pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python

Last synced: 25 Dec 2024

https://github.com/piotr-kalanski/big-data-dev-environment-docker

Big Data Development environment based on Docker

big-data docker elasticsearch hadoop kafka kibana spark

Last synced: 27 Oct 2024

https://github.com/ren294/covid-data-process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.

airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql

Last synced: 11 Oct 2024

https://github.com/kanchishimono/scopt

Calculate optimized properties of Spark configuration

pyspark python python3 spark

Last synced: 28 Nov 2024

https://github.com/nhsdigital/rap_example_pipeline_python

An example pipeline made in a RAP friendly way, using Python

aggregation artificial hospital-episode-statistics pyspark python spark

Last synced: 23 Dec 2024

https://github.com/spratiher9/exelog

Enabling meticulous logging for your Spark Applications

analytics apache apache-spark aws azure bigdata databricks gcp logging pyspark python spark spark-utils

Last synced: 12 Oct 2024

https://github.com/vitalibo/spark-aws-orchestration

Deployment/Orchestration of Apache Spark applications on Amazon EMR.

aws cloudformation emr spark step-functions

Last synced: 07 Nov 2024

https://github.com/radanalyticsio/workshop-notebook

Basic Jupyter notebook for learning Spark and OpenShift

containers data-science jupyter openshift spark

Last synced: 05 Nov 2024

https://github.com/renardeinside/wikiflow

Wikipedia updates streaming, transformation and visualisation

akka-http apache-spark kafka spark spark-streaming visualization wikipedia

Last synced: 12 Dec 2024

https://github.com/mahmoud-nfz/football-big-data

This is a comprehensive solution for real-time football analytics, leveraging Apache Spark execution on yarn for both streaming and batch processing, Hadoop HDFS for distributed storage, Kafka for real-time data ingestion, rethinkdb for live data updates , a custom built search engine and Next.js for data visualization.

hadoop hadoop-hdfs kafka nextjs rethinkdb search-engine spark spark-streaming t3-stack

Last synced: 10 Oct 2024

https://github.com/sircamp/mushrooms-ml-classfier-scala-spark

These are different machine learning algorithms used to classify and predict the poisoning of mushrooms

classification machine-learning machine-learning-algorithms scala spark

Last synced: 19 Nov 2024

https://github.com/myxof/sparknotes

Spark 2.0学习笔记

distributed-computing spark spark-sql

Last synced: 15 Oct 2024

https://github.com/manuparra/instalacion-bigdata-upnavarra

Taller de instalación de Hadoop, HDFS, Spark, Scala y R para DataMining / ML en modo Multi nodo

bigdata hadoop hdfs multinode scala setup spark workshop

Last synced: 07 Nov 2024

https://github.com/mgarralda/hadoop-spark-cluster

Repository containing Docker images for create a cluster Spark on Hadoop Yarn.

hadoop-hdfs spark spark-cluster spark-hadoop spark-hadoop-docker spark-yarn-docker

Last synced: 11 Nov 2024

https://github.com/garystafford/dataproc-workflow-templates

Demonstration of Google Cloud Dataproc Workflow Templates

dataproc gcp google-cloud-platform hadoop pyspark spark

Last synced: 06 Dec 2024

https://github.com/aureliusivan/spotify-recommender-system-with-word2vec

This project is a recommender system for Spotify songs. The system uses the Word2Vec model to find similar songs based on the song's lyrics.

keras mongodb spark word2vec

Last synced: 27 Oct 2024

https://github.com/zero323/dlt

Mirror of https://gitlab.com/zero323/dlt

apache-spark delta delta-io delta-lake r rstats spark sparkr

Last synced: 27 Oct 2024

https://github.com/atalii/adage

ada privilege escalation

ada security spark sudo

Last synced: 26 Oct 2024

https://github.com/harpin-ai/toolkit-examples

Examples for trying out the harpin AI identity resolution and data quality toolkit

data-engineering data-quality dedupe deduplication entity-resolution identity identity-resolution spark

Last synced: 01 Nov 2024

https://github.com/fiqryq/native-ui-spark-ar

Script Native UI Spark AR.

ar facebook js spark

Last synced: 27 Nov 2024

https://github.com/aikuyun/spark-all

Spark core sql streaming mllib

ml mllib scala spark

Last synced: 17 Dec 2024

https://github.com/sneaksanddata/hadoop-fs-wrapper

Python Wrappers for Hadoop FileSystem

distributed-computing hadoop spark

Last synced: 11 Nov 2024

https://github.com/abronte/pysparkproxy

Seamlessly execute pyspark code on remote clusters

bigdata pyspark python spark

Last synced: 28 Oct 2024

https://github.com/abhirockzz/synapse-azure-data-explorer-101

Getting started with Azure Synapse and Azure Data Explorer

azure-data-explorer azure-synapse-analytics pyspark python spark

Last synced: 21 Dec 2024

https://github.com/smola/spark-glusterfs-example

An example of Apache Spark integration with GlusterFS.

example-project glusterfs maven scala spark

Last synced: 16 Dec 2024

https://github.com/joristruong/youtube-setl

Youtube SETL is a project that aims at providing a project exercise to practice the SETL Framework: https://github.com/JCDecaux/setl

etl exercise practice scala setl-framework spark

Last synced: 18 Dec 2024

https://github.com/walshydev/spark-ratelimiter

Easy rate-limit implementation for SparkJava.

java rate-limiter rate-limiting ratelimit spark sparkjava

Last synced: 08 Nov 2024

https://github.com/rootsongjc/spark-on-k8s

Spark on kubernetes 中文文档 - https://jimmysong.io/spark-on-k8s

kubernetes spark

Last synced: 27 Oct 2024

https://github.com/kairen/spark-ceph-example

Learning how to integrate Ceph S3 with Spark.

ceph java librados rados spark

Last synced: 26 Dec 2024

https://github.com/r-spark/sparklyr.confluent.avro

Confluent Schema Registry avro support for sparklyr

avro confluent rstats spark sparklyr

Last synced: 12 Jan 2025

https://github.com/amadeusitgroup/elastic-scaling

Elastic scaling is a library that allows to control the number of resources (executors or workers) instantiated by a Spark Structured Streaming Job in order to optimize the effective microbatch duration.

spark spark-structured-streaming

Last synced: 10 Nov 2024

https://github.com/mliarakos/spark-typed-ops

Lightweight type-safe operations for Spark

scala scala-macros shapeless spark spark-scala spark-sql

Last synced: 05 Dec 2024

https://github.com/manuelgil/vscode-codeigniter4-spark

CodeIgniter 4 Spark is a Visual Studio Code extension that provides a set of useful commands and shortcuts for CodeIgniter 4 framework.

codeigniter commands spark vscode vscode-extension

Last synced: 19 Nov 2024

https://github.com/tjc-lp/spark-instructor

A library for building structured LLM responses with Spark

databricks llm pydantic pydantic-v2 spark

Last synced: 12 Oct 2024

https://github.com/absaoss/spark-data-standardization

A library for Spark that helps to stadardize any input data (DataFrame) to adhere to the provided schema.

data-quality data-structures scala schema spark

Last synced: 07 Nov 2024

https://github.com/zejnilovic/scala-spark-template.g8

Scala + Spark template using Giter8

giter8-template sbt scala spark template

Last synced: 07 Nov 2024

https://github.com/nmarus/node-red-contrib-spark

Node-RED Nodes to integrate with the Cisco Webex Teams API

cisco node-red spark

Last synced: 25 Oct 2024

https://github.com/brooksian/sbir_tfidf_kmeans

Document clustering using KMeans on TF/IDF features on Small Business Innovation Research (SBIR) data

machine-learning spark sparksql zeppelin-notebook

Last synced: 18 Nov 2024

https://github.com/dineshkarthik/n-gram_processor

Using n-gram get set of words and their frequency of occurrence in given directory / sub-directory/ text file, which are present in a specific order at specific distance from a word.

ngrams pyspark spark

Last synced: 17 Nov 2024

https://github.com/cclient/spark-streaming-kafka-offset-mysql

mysql 维护 kafka offset,支持追踪并回滚到某个'异常'时间点,重新消费

mysql offset spark spark-streaming

Last synced: 16 Nov 2024

https://github.com/cclient/spark-java-mongo-demo

hadoop-on-mongo demo 迁移至 spark-on-hadoop-mongo 再迁移至 mongo-spark-connector

hadoop mongodb spark

Last synced: 16 Nov 2024

https://github.com/aamend/ml-registry

Enabling continuous delivery and improvement of Spark pipeline models through devops methodology and ML governance

datascience devops machinelearning maven ml nexus spark

Last synced: 08 Nov 2024

https://github.com/konradmalik/spark-kafka-cassandra

This is an example/demo of Kafka - Spark Streaming - Cassandra/Kafka interoperability, with Spark Streaming as a focal point.

cassandra kafka spark spark-streaming streaming

Last synced: 16 Nov 2024

https://github.com/shutterstock/spark-phrases

phrase detection using Google's Word2phrase

ml nlp python spark spark-ml

Last synced: 21 Jan 2025

https://github.com/brooksian/censusecon

Data Mining Census ECON using Apache Spark

spark sparksql zeppelin-notebook

Last synced: 18 Nov 2024

https://github.com/canerturkseven/forecastflowml

🧙 Scalable machine learning forecasting framework with Pyspark

forecasting lightgbm machine-learning python spark time-series xgboost

Last synced: 27 Oct 2024

https://github.com/rberenguel/identity-graphs

Presentation about Graphframes and how we handle graphs with more than 2 billion nodes at Hybrid Theory

graphframes spark

Last synced: 06 Dec 2024

https://github.com/utnaf/neo4j-connector-apache-spark-notebooks

Collection of notebooks to get started with Neo4j Connector for Apache Spark

neo4j spark

Last synced: 14 Dec 2024

https://github.com/fpopic/wt-interview-challenge

(Interview) WT Data Engineer Interview Challenge

csv mysql scala spark spark-dataset sparksql

Last synced: 10 Jan 2025

https://github.com/jaehyeon-kim/iceberg-etl-demo

Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment

datawarehousing emr etl iceberg scd spark

Last synced: 17 Dec 2024

https://github.com/chezou/sparklytd

spaklyr plugin for td-spark to connect TD from R

dplyr spark sparklyr treasuredata

Last synced: 15 Oct 2024

https://github.com/9bow/komoranrestapiserver

Simple RESTful API Server for KOMORAN

java-8 komoran nlp restful-api spark sparkjava sparkjava-framework

Last synced: 18 Dec 2024

https://github.com/americanexpress/bloom

BLooM is a configuration driven bigdata framework to load massive data into MemSQL

bulk-loader java memsql spark

Last synced: 10 Nov 2024

https://github.com/samueleresca/deequ.net

deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

apache-spark bigdata deequ dotnet spark

Last synced: 11 Nov 2024

https://github.com/sankamuk/spark-kubernetes

Production run of Apache Spark on Kubernetes

airflow apache-spark iac kubernetes spark

Last synced: 13 Nov 2024

https://github.com/erikerlandson/spark-tekton-demo

demo of running apache spark jobs using tekton and s2i workflows

apache-spark kubernetes openshift openshift-pipelines s2i spark tekton tekton-pipelines tektoncd

Last synced: 06 Jan 2025

https://github.com/us8945/aws_emr_pysparkling

Set Up Python environment on AWS EMR cluster with H2O Sparkling Water (Pysparling)

aws emr h2o jupyter-notebook pyspark pysparkling spark sparkling-water

Last synced: 21 Jan 2025

https://github.com/hafen/strata2017

Repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial at Strata Hadoop World 2017

r spark tutorial

Last synced: 27 Oct 2024

https://github.com/salva/fastdbfs

fastdbfs - An interactive command line client for Databricks DBFS.

databricks dbfs spark

Last synced: 16 Nov 2024

https://github.com/bitlap/kspark

Kotlin for Apache Spark

apache-spark kotlin spark

Last synced: 19 Jan 2025

https://github.com/wangshibiaoflytiger/springmvc

java spring项目开发脚手架,主要用于学习和技术调研. 涉及的相关技术(spring + springboot + gradle项目构建 + mybatisplus + redis + HikariCP数据源 + 定时任务 + aop切面 + 自定义filter + 自定义拦截器 + 阿里云对象存储oss + kafka消息队列 + 认证授权shiro + scala和java混合编程 + 大数据spark + orm springdatajpa + orm jooq + jacoco生成测试报告 + sonar生成项目分析报告)

aop cron filter hikaricp interceptor jacoco java jooq jpa kafka mybatis orm oss redis scala shiro sonar spark spring springboot

Last synced: 10 Nov 2024

https://github.com/cbozan/graduation-project

Graduation project categorizes popular search phrases using Python and Spark and presents them on a website to inspire creators.

crisp-dm data-cleaning data-science machine-learning nlp nlp-machine-learning spark spark-mllib

Last synced: 23 Nov 2024

https://github.com/comcast/pxscene-ui

A declarative JavaScript library for building React-ish UI components for pxScene (aka Spark) apps

component frontend javascript library px2react pxscene react spark ui

Last synced: 14 Nov 2024

https://github.com/cevheri/spark-tutorial

Apache Spark Tutorial - Scala, Java, Python code samples

cassandra java kafka kafka-consumer kafka-producer mongodb python scala spark spark-streaming

Last synced: 09 Nov 2024

https://github.com/nickjer/singularity-rstudio-spark

Apache Spark with RStudio and the sparklyr package in a Singularity container

rstudio-server singularity-image spark

Last synced: 14 Nov 2024

https://github.com/surajiyer/python-data-utils

🚀 Utility classes and functions for common data science libraries

clustering etc matplotlib multiview-clustering nlp pandas sklearn spark statsmodels utilities

Last synced: 10 Dec 2024

https://github.com/mrcolorr/supreme-pancake

Big Data Management project: The collection of data from a network of sensors was simulated (kafka), which then had to be processed (spark) and stored (cassandraDB) in a distributed and efficient way.

big-data bigdata cassandra cassandra-cluster cassandra-database cloud cloud-computing distributed-computing distributed-database distributed-storage distributed-systems hdfs kafka maven maven-pom spark zerotier zerotier-network zerotier-one

Last synced: 13 Nov 2024

https://github.com/s8sg/spark-py-submit

A python library to submit spark job in yarn cluster at different distributions (Currently CDH, HDP)

cdh hdfs hdp python-library spark spark-clusters spark-job

Last synced: 05 Dec 2024

https://github.com/mobiletelesystems/spark-dialect-extension

Package extending the default dialect capabilities for Spark.

etl etl-components plugin-system spark

Last synced: 11 Oct 2024

https://github.com/ging/fiware-cosmos

The Cosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms.

analysis big-data fiware fiware-cosmos flink processing real-time-analytics spark streaming-engine

Last synced: 01 Nov 2024

https://github.com/julienpeloton/mini_spark_broker

Design and proof-of-concept for a Broker for astronomy using Apache Spark

docker kafka python spark spark-structured-streaming

Last synced: 11 Oct 2024

https://github.com/conema/spark-terraform

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

aws cluster hadoop hadoop-cluster hcl spark spark-clusters terraform

Last synced: 20 Nov 2024

https://github.com/gabfr/truck-data-wrangler

ELT (Extract, Load, Transform) process of accelerometer/gyroscope events with Apache Spark (w/ Structured Streaming) and TimescaleDB

data-classification spark stream timescaledb

Last synced: 07 Dec 2024

https://github.com/abronte/pysparkgateway

Connect to remote Spark clusters seamlessly.

apache-spark bigdata pyspark python spark

Last synced: 28 Oct 2024

https://github.com/joeyism/commonly-used-pyspark-commands

A list of commonly used pyspark commands

common frequent pyspark python spark

Last synced: 29 Dec 2024

https://github.com/tomwhite/disq-original

A library for manipulating bioinformatics sequencing formats in Apache Spark.

bioinformatics genomics ngs sequencing spark

Last synced: 18 Dec 2024

https://github.com/imlegend19/vidspark

VidSpark is a prototype video CMS backend system powered by spark and elasticsearch

celery elasticsearch python redis scala spark

Last synced: 14 Jan 2025

https://github.com/multivacplatform/multivac-kaggle-titanic

Simple example of Titanic competition by Spark 2.2

kaggle-competition machine-learning scala spark

Last synced: 12 Jan 2025

https://github.com/logikal-io/mindlab

Data science toolbox

data jupyterlab python spark

Last synced: 12 Oct 2024

https://github.com/michabirklbauer/mahout_docker

Running Apache Mahout in Docker.

apache docker dockerfile hadoop mahout maven spark

Last synced: 04 Jan 2025