An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with apache-spark

A curated list of projects in awesome lists tagged with apache-spark .

https://github.com/mlflow/mlflow

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering

Last synced: 27 Dec 2025

https://github.com/lw-lin/coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 14 May 2025

https://github.com/lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 04 Apr 2025

https://github.com/spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

apache-spark data-science notebook reactive scala spark

Last synced: 14 May 2025

https://github.com/kubeflow/spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark

Last synced: 25 Apr 2025

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark

Last synced: 24 Apr 2025

https://github.com/intel/bigdl

BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray

analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala

Last synced: 14 May 2025

https://github.com/OryxProject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx

Last synced: 27 Mar 2025

https://github.com/oryxproject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx

Last synced: 03 Oct 2025

https://github.com/japila-books/apache-spark-internals

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 15 May 2025

https://github.com/databricks/learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming

Last synced: 14 May 2025

https://github.com/databricks/spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

apache-spark grid-search machine-learning parameter-tuning scikit-learn

Last synced: 30 Sep 2025

https://github.com/graphframes/graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

apache-spark big-data connected-components dataframe dataframes graphs network-motif network-motifs networks spark

Last synced: 14 May 2025

https://github.com/lucacanali/sparkmeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 14 May 2025

https://github.com/miguno/kafka-storm-starter

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

apache-avro apache-kafka apache-spark apache-storm avro integration kafka scala spark storm

Last synced: 17 Dec 2025

https://github.com/LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 18 Jul 2025

https://github.com/mrpowers-io/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Last synced: 14 Apr 2025

https://github.com/nchammas/flintrock

A command-line tool for launching Apache Spark clusters.

apache-spark apache-spark-cluster ec2 orchestration spark-ec2

Last synced: 14 May 2025

https://github.com/cerndb/dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow

Last synced: 03 Oct 2025

https://github.com/apache-spark-on-k8s/spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

apache-spark kubernetes kubernetes-cluster

Last synced: 03 Oct 2025

https://github.com/openscoring/openscoring

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

apache-spark api lightgbm pmml r real-time scikit-learn xgboost

Last synced: 15 May 2025

https://github.com/tweag/sparkle

Haskell on Apache Spark.

analytics apache-spark haskell spark

Last synced: 16 May 2025

https://github.com/lucacanali/miscellaneous

Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's performance. Jupyter notebooks examples for using various DB systems.

apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing

Last synced: 16 May 2025

https://github.com/cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

apache-spark big-data pyspark spark

Last synced: 29 Oct 2025

https://github.com/LucaCanali/Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.

apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing

Last synced: 13 Apr 2025

https://github.com/ekampf/pyspark-boilerplate

A boilerplate for writing PySpark Jobs

apache-spark boilerplate pyspark python

Last synced: 05 Apr 2025

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 03 Oct 2025

https://github.com/opencypher/morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

apache-spark apache2 big-data cypher graph scala

Last synced: 05 Apr 2025

https://github.com/miguno/wirbelsturm

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant

Last synced: 03 Oct 2025

https://github.com/Hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 27 Mar 2025

https://github.com/hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 05 Apr 2025

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 15 May 2025

https://github.com/lifeomic/sparkflow

Easy to use library to bring Tensorflow on Apache Spark

apache-spark dataframe deep-learning lifeomic pipeline spark-ml tensorflow

Last synced: 04 Apr 2025

https://github.com/svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

apache-spark data-processing data-science python

Last synced: 07 Apr 2025

https://github.com/hortonworks-spark/spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas

apache-atlas apache-spark

Last synced: 28 Oct 2025

https://github.com/PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 19 Jul 2025

https://github.com/piercingdan/spark-jupyter-aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 12 May 2025

https://github.com/mellanox/sparkrdma

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx

apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark

Last synced: 03 Oct 2025

https://github.com/airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics

Last synced: 16 May 2025

https://github.com/chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 13 Apr 2025

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 28 Apr 2025

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 16 May 2025

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 03 Oct 2025

https://github.com/vinta/albedo

A recommender system for discovering GitHub repos, built with Apache Spark

apache-spark elasticsearch feature-engineering machine-learning python recommender-system scala

Last synced: 09 Aug 2025

https://github.com/lamastex/scalable-data-science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

apache-spark data-science databricks scala

Last synced: 16 May 2025

https://github.com/radanalyticsio/spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

apache-spark kubernetes kubernetes-operator openshift spark

Last synced: 07 May 2025

https://github.com/BitwiseInc/Hydrograph

A visual ETL development and debugging tool for big data

apache-spark big-data cascading etl etl-framework

Last synced: 03 Apr 2025

https://github.com/bitwiseinc/hydrograph

A visual ETL development and debugging tool for big data

apache-spark big-data cascading etl etl-framework

Last synced: 24 Oct 2025

https://github.com/SANSA-Stack/SANSA-Stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 09 Jul 2025

https://github.com/sansa-stack/sansa-stack

Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/

apache-jena apache-spark distributed-computing flink rdf semantic-web spark

Last synced: 04 Apr 2025

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 13 Apr 2025

https://github.com/memverge/splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

apache-spark bigdata disaggregation elasticity java scala shuffle spark storage

Last synced: 02 Sep 2025

https://github.com/jleetutorial/scala-spark-tutorial

Project for James' Apache Spark with Scala course

apache-spark big-data scala

Last synced: 09 Apr 2025

https://github.com/googlecloudplatform/dataproc-templates

Dataproc templates and pipelines for solving simple in-cloud data tasks

apache-spark bigquery gcp google-cloud google-cloud-platform jupyter-notebook pyspark

Last synced: 15 May 2025

https://github.com/zero323/pyspark-stubs

Apache (Py)Spark type annotations (stub files).

apache-spark mypy pep484 pyspark python python-3 stub-files type-annotations

Last synced: 03 Oct 2025

https://github.com/jgperrin/net.jgp.books.spark.ch01

Spark in Action, 2nd edition - chapter 1 - Introduction

apache-spark java java8 manning spark sparkwithjava

Last synced: 20 Aug 2025

https://github.com/dimajix/flowman

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.

apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql

Last synced: 04 Apr 2025