Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

https://github.com/santiagortiiz/advanced-data-engineering-with-databricks

Databricks. Incremental data processing, task orchestration, and production job monitoring.

big-data databricks databricks-notebooks kafka spark spark-streaming streaming

Last synced: 08 Jan 2025

https://github.com/hexnn/stark

基于Spark+Debezium打造的简单易用、超高性能大数据治理引擎,适用于批流一体的数据集成和数据分析场景,支持CDC实时数据采集,支持海量数据同步、数据建模和OLAP数据分析

cdc datax debezium flink hadoop seatunnel spark

Last synced: 11 Oct 2024

https://github.com/navicore/spark-on-kubernetes

docker image of spark for k8s

apache-spark k8s kubernetes scala spark

Last synced: 06 Nov 2024

https://github.com/fscm/packer-aws-spark

Packer Template to build a AWS Apache Spark AMI

ami aws packer spark

Last synced: 07 Nov 2024

https://github.com/flarco/dbnet-python

A Python/VueJS database client (Web GUI) to access Oracle, Spark (Hive), Postgres, etc.

apache-spark database jdbc oracle postgresql spark web-gui

Last synced: 22 Jan 2025

https://github.com/kavgan/spark-examples

Examples of code in spark

pyspark spark

Last synced: 30 Oct 2024

https://github.com/stonezhong/DataManager

Better organize data in data lake and build ETL pipeline with Web UI tool.

datalake datawarehouse etl spark sparksql

Last synced: 27 Nov 2024

https://github.com/fancellu/graphx-citymap

CityMap coding test plus 3 solutions, 1 with Spark/GraphX

graphx scalatest spark

Last synced: 10 Nov 2024

https://github.com/lucasbotang/coursera_big_data_for_data_engineers

Assignments for Big Data for Data Engineers specialization on Coursera by Yandex.

hadoop hive spark spark-sql

Last synced: 25 Nov 2024

https://github.com/mdrakiburrahman/sgx-pyspark-sql-demo

Demonstrating Confidential Analytics on Azure SGX VM's with Apache Spark and SCONE.

azure azure-sql-database docker kubernetes sgx spark

Last synced: 09 Nov 2024

https://github.com/ibmstreams/streamsx.sparkmllib

Toolkit for real-time scoring using Apache Spark MLLib library

ibm-streams spark spark-mllib-library stream-processing toolkit

Last synced: 23 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch13

Spark in Action, 2nd edition - chapter 13 - Transforming documents

apache-spark java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/duhanmin/log-router

spark flink 分布式 运行日志 收集 转发 spring

java kafka spark

Last synced: 21 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch10

Spark in Action, 2e - chapter 10 - Ingestion through structured streaming

bigdata book java java8 manning spark sparkstreaming sparkwithjava

Last synced: 09 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch16

Spark in Action, 2nd edition - chapter 16 - performance, checkpointing, and caching

apache-spark cache checkpoint java java8 manning spark sparkwithjava

Last synced: 09 Nov 2024

https://github.com/hamza88-coder/real-time-recruitment-system-with-ai-and-data-analytics

Simulation of job offers and CVs with real-time processing, classification, and analytics using Kafka, Ray, Spark, and Databricks. Includes a Flask-based recommendation system and Tableau visualizations.

apache-nifi chatbot databricks dbt delta-lake docker faiss flask k-means kafka llama3 pinecone postgresql ray redis snowflake spark sparkml

Last synced: 13 Jan 2025

https://github.com/pyaesoneaungrgn/vitepress-pilgrim-starter

Documentation template styled like Forge, Envoyer, Vapor, Jetstream, and Spark

documentation envoyer forge jetstream laravel pilgrim spark tailwindcss vapor vite vitepress vitepress-doc vitepress-starter

Last synced: 02 Jan 2025

https://github.com/coxautomotivedatasolutions/vegalite4s

Vega-Lite4s is a small library over the comprehensive Vega-Lite Javascript visualisation library, allowing you to create beautiful Vega-Lite visualisations in Scala

apache-spark scala spark vega vega-lite visualization

Last synced: 30 Sep 2024

https://github.com/pierrenodet/aruku

A Random Walk Engine for Apache Spark

deepwalk graph node2vec random-walk spark

Last synced: 10 Oct 2024

https://github.com/chasel-shao/realtimevideoanalysis

Realtime Video Analysis use Spark, Kafka, Zookeeper, OpenCV

kafka opencv python spark zookeeper

Last synced: 10 Nov 2024

https://github.com/absaoss/hermes

A E2E test tool for Enceladus. Also general dataframe comparison tool

atum dataset-comparison e2e-tests enceladus spark

Last synced: 07 Nov 2024

https://github.com/msukmanowsky/drpyspark

Handy utilities for debugging and tuning pyspark programs. A work in progress.

pyspark python spark tuning-pyspark-programs

Last synced: 09 Nov 2024

https://github.com/edyoda/big-data-analytics-pipeline

Build your own Big Data Analytics Pipeline using Kafka-Spark-Cassandra. Videos ->

cassandra kafka spark

Last synced: 18 Nov 2024

https://github.com/setl-framework/setl-examples

Learn SETL with examples, lessons and exercises

etl framework scala setl spark

Last synced: 13 Nov 2024

https://github.com/suvayu/emr-scripts

Shell scripts for AWS EMR clusters

aws-cli aws-emr-clusters cluster spark

Last synced: 12 Oct 2024

https://github.com/hibayesian/spark-optim

A library of scalable optimization algorithms based on Spark

machine-learning optimization-algorithms spark

Last synced: 23 Nov 2024

https://github.com/spamegg1/scalacapstone

Scala Spec Capstone

scala spark

Last synced: 21 Jan 2025

https://github.com/adidas/lakehouse-engine-docs

The Goal of this project is to provide documentation for the Lakehouse Engine framework.

big-data data-engineering data-quality databricks delta-lake framework great-expectations lakehouse lakehouse-engine spark

Last synced: 12 Oct 2024

https://github.com/datumbrain/gossub

Trigger spark-submit in Golang. A Go implementation of famous SparkLauncher.java.

apachespark go golang spark

Last synced: 17 Nov 2024

https://github.com/mvillafuertem/scala

🤓 Examples Advanced 🧐 Projects Akka 🚀 ZIO ⚡️ Algorithms 😼 Cats

akka akka-streams aws cats cdktf kafka slick spark sttp tapir terraform zio zio-streams

Last synced: 07 Nov 2024

https://github.com/maxgekk/jspark

Simple jdbc client for Apache Spark

jdbc jdbc-client spark

Last synced: 15 Oct 2024

https://github.com/Nosto/spartann

Hyper performant kNN using Annoy for Apache Spark.

ann annoy apache-spark k-nearest-neighbors k-nearest-neighbours knn ml spark

Last synced: 04 Nov 2024

https://github.com/sneaksanddata/spark-utils

Comfy Utilities for Spark Job Authoring

distributed-computing spark

Last synced: 11 Nov 2024

https://github.com/edgararuiz-zz/sparkvis

Integrates ggvis and sparklyr

ggvis spark sparklyr visualization

Last synced: 09 Nov 2024

https://github.com/brayanjuls/diane

Hive helper functions for apache spark users

delta-lake hive scala spark

Last synced: 28 Oct 2024

https://github.com/trainingbypackt/big-data-processing-with-apache-spark-elearning

Efficiently tackle large datasets and perform big data analysis with Spark and Python

dataset python rdds spark spark-mllib structured-streaming

Last synced: 14 Nov 2024

https://github.com/ansrivas/yelp_dataset

Sample analysis for the latest yelp dataset using spark

scala spark yelp-dataset

Last synced: 14 Oct 2024

https://github.com/dharmeshkakadia/tpcds-hdinsight

TPCDS benchmark for various engines

benchmarking hive llap presto spark tpcds

Last synced: 18 Nov 2024

https://github.com/nandtel/spark-streaming-kafka-cassandra-starter

Application built on Spark Streaming, Kafka and Cassandra.

cassandra docker docker-compose kafka scala spark spark-streaming

Last synced: 24 Nov 2024

https://github.com/napsternxg/pubmed_selfcitationanalysis

Repository of our paper on Self-citation analysis in PubMed data

citation-analysis medline pubmed-central regression-models spark

Last synced: 13 Oct 2024

https://github.com/gerashegalov/rapids-shell

Utility to run/debug Spark RAPIDS in REPL

rapids repl spark

Last synced: 12 Oct 2024

https://github.com/naupio/pical

(Work In Process) pita is a general distributed computation system with Erlang language base on DAG model. This project is inspired by DouBan 's DPark and Apache Spark.

big-data bigdata dag data distributed distributed-computing distributed-systems erlang erlang-otp flink spark

Last synced: 13 Nov 2024

https://github.com/russellspitzer/firstsparkcassandraapp

A quick workshop on building your first Spark Cassandra Stand Alone Application

spark tutorial workshop zeppelin

Last synced: 16 Oct 2024

https://github.com/keks51/spark_plan_as_uml

visualizing spark plan as UML diagram

graph plan spark spark-streaming uml visualization

Last synced: 12 Oct 2024

https://github.com/agile-lab-dev/literate-programming-articles

Collection of articles, using the Literate Programming style, about Data Engineering and Software Tooling in general

literate-programming ruby spark spark-connect

Last synced: 17 Jan 2025

https://github.com/chen0040/spark-ml-genetic-programming

Package provides java implementation of big-data genetic programming for Apache Spark

big-data genetic-programming linear-genetic-programming rdd spark tree-genetic-programming tree-gp

Last synced: 16 Dec 2024

https://github.com/codam-coding-college/spark-sessions

Spark sessions help beginning students dissect the first larger projects of the curriculum.

codam spark spark-session

Last synced: 10 Nov 2024

https://github.com/shivam5992/classification_pipeline

:orange_book: A complete document classification pipeline using Apache Spark in scala

document-classification-pipeline scala spark text-classification

Last synced: 24 Dec 2024

https://github.com/san089/spark_packaged_project

This project contains pyspark jobs to create data pipelines and shows how to distribute the project package on Cluster.

data-pipeline etl etl-framework etl-pipeline job pyspark spark

Last synced: 16 Nov 2024

https://github.com/axsaucedo/hadoop-overview

Hands on Hadoop, services, installation

ambari hadoop hdfs hive mapreduce mesos notes pig spark yarn

Last synced: 06 Nov 2024

https://github.com/varunu28/aadhar-dataset-analysis

Data analysis of AADHAR dataset using Apache Spark

analysis scala spark spark-sql

Last synced: 08 Nov 2024

https://github.com/inbravo/scala-feature-set

-:- My random Scala experiements -:-

scala spark

Last synced: 24 Nov 2024

https://github.com/terrier-org/terrier-spark

A Spark API for the Terrier.org information retrieval platform

information-retrieval spark

Last synced: 12 Oct 2024

https://github.com/jerryshao/spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas

atlas spark

Last synced: 17 Dec 2024

https://github.com/navdeep-g/sdss-2019

Interpretable Machine Learning with rsparkling

data-science h2o-3 machine-learning r rsparkling spark sparklyr xai

Last synced: 06 Nov 2024

https://github.com/akarce/e2e-structured-streaming

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.

airflow apache-airflow apache-kafka apache-spark big-data cassandra docker docker-compose kafka postgresql python spark zookeeper

Last synced: 12 Oct 2024

https://github.com/mikma03/spark-databricks

🔥 Master Apache Spark & Databricks! Dive into a world of big data with exclusive insights from Udemy courses, personal notes, and practical guides. Whether you're starting out or scaling new heights in data engineering, this is your ultimate resource hub! 🌟🚀

apache-spark aws big-data data-engineering databricks delta-lake etl python spark streaming

Last synced: 11 Nov 2024

https://github.com/arbox/learning-scala-for-data-science

Data Science: Scala for brave and impatient

big-data bigdata data-science datascience scala spark

Last synced: 27 Nov 2024

https://github.com/newrelic-experimental/nri-spark

This New Relic standalone integration polls the Apache Spark REST API for metrics and pushes them into New Relic using Metrics API It uses the New Relic Telemetry sdk for go

apache-spark databricks databricks-notebooks metrics newrelic nrlabs nrlabs-data nrlabs-odp spark

Last synced: 14 Nov 2024

https://github.com/angeligareta/cheaper-travelling

Project developed with Apache Spark and Kafka that works with different public streaming data APIs such as SkyScanner, GeoDB Cities, and Flixbus to consider more ways of travelling in a cheaper way.

apache-spark flixbus geodb-cities kafka scala skyscanner skyscanner-api skyscanner-flight-search spark

Last synced: 22 Nov 2024

https://github.com/nineinfra/kubectl-nine

kubectl-nine is a kubectl plugin to manage the NineInfra and the NineClusters on the k8s.

airflow doirs hdfs k8s kyuubi minio nifi olap operator seatunnel spark storage-and-computing-separation superset tpcds

Last synced: 25 Oct 2024

https://github.com/azavea/hiveless

Scala API for Hive UDFs with the GIS extension

geospatial gis scala spark typelevel

Last synced: 10 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch15

Spark in Action, 2nd edition - chapter 15 - Aggregating your data

aggregation apache-spark java java8 manning spark sparkwithjava sql-aggregation udaf

Last synced: 09 Nov 2024

https://github.com/jgperrin/net.jgp.books.spark.ch14

Spark in Action, 2nd edition - chapter 14 - extending data transformation with UDFs

apache-spark java java8 manning spark sparkwithjava udf

Last synced: 09 Nov 2024

https://github.com/brh55/generator-spark-bot

:zap: Yeoman generator that scaffold out a Cisco spark bot with usability and simplicity in mind

cisco cisco-spark flint nodejs scaffold spark yeoman

Last synced: 14 Oct 2024

https://github.com/jgperrin/net.jgp.books.spark.ch05

Spark in Action, 2nd edition - chapter 5 - Deployment

apache-spark java manning spark sparkjava sparkwithjava

Last synced: 09 Nov 2024

https://github.com/chezou/amazon-movie-review

Recommendation for Amazon movie review data

factorization-machines recommendations spark

Last synced: 15 Oct 2024

https://github.com/nikoshet/monitoring-spark-on-docker

Spark Monitoring With Prometheus And Grafana Using Docker

docker docker-compose grafana hadoop hdfs monitoring node-exporter prometheus spark

Last synced: 09 Nov 2024

https://github.com/nashtech-labs/spark-streaming-gnip

An Apache Spark utility for pulling Tweets from Gnip's PowerTrack in realtime

gnip gnip-powertrack knoldus pulling-tweets realtime scala spark spark-streaming spark-utility sparkconf sparkcontext tweets

Last synced: 21 Jan 2025

https://github.com/anant/data.engineers.lunch

Resources from weekly Zoom lunches revolving around Data Engineering. Hosted by Anant Corporation.

data-engineering etl kubernetes python spark

Last synced: 18 Nov 2024

https://github.com/xxjwxc/gosparkapi

golang 讯飞星火大模型 Go SparkApi

api golang model spark xunfei

Last synced: 28 Nov 2024

https://github.com/dharmeshkakadia/tpch-hdinsight

TPCH benchmark for various engines

benchmarking hive llap presto spark tpch

Last synced: 18 Nov 2024

https://github.com/dvgodoy/dsr-spark-appliedml

DSR Class - Applied Machine Learning with Apache Spark

spark spark-ml

Last synced: 13 Oct 2024

https://github.com/setl-framework/setl-template

A simple template to start a project with SETL

etl framework scala setl spark template

Last synced: 13 Nov 2024

https://github.com/trk54ylmz/spark-bigquery

Google BigQuery support for Spark SQL

bigquery spark

Last synced: 18 Nov 2024

https://github.com/infowangxin/bizhub

数据分析平台,集成kafka、spark、hbase并附带示例

kafka spark

Last synced: 11 Oct 2024

https://github.com/chabane/mitosis-microservice-spark-cassandra

Microservice application that uses Apache Spark, Kafka and Cassandra

cassandra dockerfile hadoop jenkinsfile kafka sbt scala spark spark-streaming

Last synced: 15 Nov 2024

https://github.com/iaja/scalaLDAvis

Scala-Spark port of https://github.com/bmabey/pyLDAvis for Apache Spark LDA Topic Modelling Visualisation

apache lda machine-learning scala spark visulization

Last synced: 13 Nov 2024

https://github.com/kimtth/pyspark-tika-text-extraction

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python

Last synced: 25 Dec 2024

https://github.com/udao-moo/udao-spark-optimizer

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

knobs-tuning modeling multi-objective-optimization optimization spark sparksql

Last synced: 11 Oct 2024

https://github.com/radanalyticsio/workshop-notebook

Basic Jupyter notebook for learning Spark and OpenShift

containers data-science jupyter openshift spark

Last synced: 05 Nov 2024

https://github.com/ren294/covid-data-process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.

airflow aws aws-ec2 aws-quicksight big-data big-data-analytics covid19-data docker docker-compose hadoop-hdfs hdfs hive kafka nifi pipeline redpanda spark spark-sql spark-streaming sparksql

Last synced: 11 Oct 2024

https://github.com/garystafford/dataproc-workflow-templates

Demonstration of Google Cloud Dataproc Workflow Templates

dataproc gcp google-cloud-platform hadoop pyspark spark

Last synced: 06 Dec 2024

https://github.com/nhsdigital/rap_example_pipeline_python

An example pipeline made in a RAP friendly way, using Python

aggregation artificial hospital-episode-statistics pyspark python spark

Last synced: 23 Dec 2024

https://github.com/kanchishimono/scopt

Calculate optimized properties of Spark configuration

pyspark python python3 spark

Last synced: 28 Nov 2024

https://github.com/geotrellis/geotrellis-streaming-demo

A demo project that shows a GeoTrellis streaming application example

geotrellis gis kafka spark streaming

Last synced: 11 Nov 2024