Projects in Awesome Lists tagged with apache-spark

https://github.com/mlflow/mlflow

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering

Last synced: 06 May 2026

https://github.com/microsoft/synapseml

Simple and Distributed Machine Learning

ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse

Last synced: 13 May 2025

https://microsoft.github.io/SynapseML/

Simple and Distributed Machine Learning

ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse

Last synced: 29 Apr 2025

https://github.com/microsoft/SynapseML

Simple and Distributed Machine Learning

ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse

Last synced: 14 Mar 2025

https://github.com/treeverse/lakefs

lakeFS - Data version control for your data lake | Git for data

apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage

Last synced: 18 Feb 2026

https://github.com/treeverse/lakeFS

lakeFS - Data version control for your data lake | Git for data

apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage

Last synced: 20 Mar 2025

https://github.com/lw-lin/coolplayspark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 14 May 2025

https://github.com/lw-lin/CoolplaySpark

酷玩 Spark: Spark 源代码解析、Spark 类库等

apache-spark spark spark-streaming sparkcore structured-streaming

Last synced: 04 Apr 2025

https://github.com/spark-notebook/spark-notebook

Interactive and Reactive Data Science using Scala and Spark.

apache-spark data-science notebook reactive scala spark

Last synced: 14 May 2025

https://github.com/kubeflow/spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark

Last synced: 25 Apr 2025

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark

Last synced: 24 Apr 2025

https://github.com/intel/bigdl

BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray

analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala

Last synced: 14 May 2025

https://github.com/dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

analytics apache-spark azure bigdata csharp databricks dotnet dotnet-core dotnet-standard emr fsharp hdinsight machine-learning microsoft spark spark-sql spark-streaming streaming tpcds tpch

Last synced: 11 May 2025

https://github.com/big-data-europe/docker-spark

Apache Spark docker image

apache-spark docker k8s-spark kubernetes spark-kubernetes

Last synced: 15 May 2025

https://github.com/feathr-ai/feathr

Feathr – A scalable, unified data and AI engineering platform for enterprise

apache-spark artificial-intelligence azure data-engineering data-quality data-science feature-engineering feature-governance feature-management feature-marketplace feature-metadata feature-platform feature-store machine-learning mlops

Last synced: 09 Jan 2026

https://github.com/OryxProject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx

Last synced: 27 Mar 2025

https://github.com/oryxproject/oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx

Last synced: 03 Oct 2025

https://github.com/japila-books/apache-spark-internals

The Internals of Apache Spark

apache-spark book internals spark

Last synced: 15 May 2025

https://github.com/ptyadana/sql-data-analysis-and-visualization-projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

apache-spark challenges data-analysis digital-music-store exercises mysql mysql-database mysql-notes mysqlworkbench pgadmin postgres postgresql pyspark python sql sql-data-analysis sql-queries sqlite tableau

Last synced: 16 May 2025

https://github.com/san089/goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse

Last synced: 16 May 2025

https://github.com/databricks/learningsparkv2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming

Last synced: 14 May 2025

https://github.com/lensacom/sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

apache-spark distributed-computing machine-learning python scikit-learn

Last synced: 15 May 2025

https://github.com/databricks/spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

apache-spark grid-search machine-learning parameter-tuning scikit-learn

Last synced: 30 Sep 2025

https://github.com/mahmoudparsian/data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

apache-hadoop apache-spark data-algorithms design-patterns distributed-algorithms distributed-computing hadoop-mapreduce java machine-learning mappers mapreduce partitioning pyspark python reducers scala

Last synced: 14 May 2025

https://github.com/graphframes/graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

apache-spark big-data connected-components dataframe dataframes graphs network-motif network-motifs networks spark

Last synced: 02 Apr 2026

https://github.com/sparklyr/sparklyr

R interface for Apache Spark

apache-spark distributed dplyr ide livy machine-learning r remote-clusters rstats spark sparklyr

Last synced: 14 May 2025

https://github.com/Microsoft/Mobius

C# and F# language binding and extensions to Apache Spark

apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming

Last synced: 14 Mar 2025

https://github.com/microsoft/Mobius

C# and F# language binding and extensions to Apache Spark

apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming

Last synced: 08 Apr 2025

https://github.com/microsoft/mobius

C# and F# language binding and extensions to Apache Spark

apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming

Last synced: 14 May 2025

https://github.com/lucacanali/sparkmeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 14 May 2025

https://github.com/lw-lin/streaming-readings

Streaming System 相关的论文读物

apache-spark dataflow drizzle flink heron millwheel s4 spark-streaming spe storm stream-processing stream-processing-engine streaming streaming-engine

Last synced: 04 Apr 2025

https://github.com/miguno/kafka-storm-starter

[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.

apache-avro apache-kafka apache-spark apache-storm avro integration kafka scala spark storm

Last synced: 17 Dec 2025

https://github.com/aloneguid/parquet-dotnet

Fully managed Apache Parquet implementation

apache-parquet apache-spark dotnet dotnet-core dotnet-standard ios linux windows xamarin xbox

Last synced: 13 May 2025

https://github.com/LucaCanali/sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

apache-spark performance-metrics performance-troubleshooting python scala spark

Last synced: 18 Jul 2025

https://github.com/mrpowers-io/quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Last synced: 09 Jun 2026

https://github.com/nchammas/flintrock

A command-line tool for launching Apache Spark clusters.

apache-spark apache-spark-cluster ec2 orchestration spark-ec2

Last synced: 14 May 2025

https://github.com/cerndb/dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow

Last synced: 03 Oct 2025

https://github.com/apache-spark-on-k8s/spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/

apache-spark kubernetes kubernetes-cluster

Last synced: 03 Oct 2025

https://github.com/openscoring/openscoring

REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models

apache-spark api lightgbm pmml r real-time scikit-learn xgboost

Last synced: 11 Feb 2026

https://github.com/japila-books/spark-sql-internals

The Internals of Spark SQL

apache-spark book internals mkdocs-material spark spark-sql

Last synced: 23 Jan 2026

https://github.com/rjurney/agile_data_code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant

Last synced: 12 Apr 2025

https://github.com/rjurney/Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant

Last synced: 19 Jul 2025

https://github.com/tweag/sparkle

Haskell on Apache Spark.

analytics apache-spark haskell spark

Last synced: 16 May 2025

https://github.com/lucacanali/miscellaneous

Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's performance. Jupyter notebooks examples for using various DB systems.

apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing

Last synced: 16 May 2025

https://github.com/cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

apache-spark big-data pyspark spark

Last synced: 29 Oct 2025

https://github.com/LucaCanali/Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.

apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing

Last synced: 13 Apr 2025

https://github.com/japila-books/spark-structured-streaming-internals

The Internals of Spark Structured Streaming

apache-spark book internals mkdocs-material spark structured-streaming

Last synced: 05 Apr 2025

https://github.com/ekampf/pyspark-boilerplate

A boilerplate for writing PySpark Jobs

apache-spark boilerplate pyspark python

Last synced: 05 Apr 2025

https://github.com/opencypher/morpheus

Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.

apache-spark apache2 big-data cypher graph scala

Last synced: 01 Apr 2026

https://github.com/tirthajyoti/spark-with-python

Fundamentals of Spark with Python (using PySpark), code examples

analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql

Last synced: 05 Apr 2025

https://github.com/datamechanics/delight

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui

Last synced: 03 Oct 2025

https://github.com/dmmiller612/sparktorch

Train and run Pytorch models on Apache Spark.

apache-spark deep-learning distributed-computing inference pipelines pytorch sparktorch

Last synced: 05 Apr 2025

https://github.com/miguno/wirbelsturm

[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.

apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant

Last synced: 03 Oct 2025

https://github.com/Hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 27 Mar 2025

https://github.com/hydrospheredata/mist

Serverless proxy for Spark cluster

apache-spark api big-data serverless

Last synced: 05 Apr 2025

https://github.com/microsoft/data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data

Last synced: 15 May 2025

https://github.com/mingchen0919/learning-apache-spark

Notes on Apache Spark (pyspark)

apache-spark machine-learning pyspark-tutorial

Last synced: 06 Apr 2025

https://github.com/MingChen0919/learning-apache-spark

Notes on Apache Spark (pyspark)

apache-spark machine-learning pyspark-tutorial

Last synced: 26 Mar 2025

https://github.com/lifeomic/sparkflow

Easy to use library to bring Tensorflow on Apache Spark

apache-spark dataframe deep-learning lifeomic pipeline spark-ml tensorflow

Last synced: 04 Apr 2025

https://github.com/josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

apache-spark data-engineering data-pipeline minio pyspark pyspark-notebook

Last synced: 15 Apr 2025

https://github.com/cuebook/cuelake

Use SQL to build ELT pipelines on a data lakehouse.

apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook

Last synced: 07 Apr 2025

https://github.com/svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

apache-spark data-processing data-science python

Last synced: 07 Apr 2025

https://github.com/hortonworks-spark/spark-atlas-connector

A Spark Atlas connector to track data lineage in Apache Atlas

apache-atlas apache-spark

Last synced: 28 Oct 2025

https://github.com/jaceklaskowski/spark-workshop

Apache Spark™ and Scala Workshops

apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop

Last synced: 05 Apr 2025

https://github.com/PiercingDan/spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 19 Jul 2025

https://github.com/piercingdan/spark-jupyter-aws

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters

Last synced: 12 May 2025

https://github.com/dataflint/spark

Performance Observability for Apache Spark

apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator

Last synced: 10 May 2026

https://github.com/mellanox/sparkrdma

This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx

apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark

Last synced: 03 Oct 2025

https://github.com/airscholar/e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics

Last synced: 16 May 2025

https://github.com/azure/azure-event-hubs-spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

apache apache-spark azure bigdata connector continuous databricks event-hubs eventhubs ingestion kafka microsoft real-time scala spark spark-streaming stream streaming structured-streaming

Last synced: 13 Feb 2026

https://github.com/Chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 28 Apr 2025

https://github.com/chabane/bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api

Last synced: 13 Apr 2025

https://github.com/Azure/azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB

apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark

Last synced: 10 May 2025

https://github.com/azure/azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB

apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark

Last synced: 02 Mar 2025

https://github.com/lynnlangit/learning-hadoop-and-spark

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount

Last synced: 16 May 2025

https://github.com/databrickslabs/automl-toolkit

Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.

apache-spark feature-engineering machinelearning ml pyspark scala spark

Last synced: 03 Oct 2025

https://github.com/whylabs/whylogs-java

Profile and monitor your ML data pipeline end-to-end

ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs

Last synced: 03 Oct 2025

https://github.com/ibm/spark-tpc-ds-performance-test

Use the TPC-DS benchmark to test Spark SQL performance

apache-spark ibm-developer-technology-cognitive ibmcode jupyter-notebook tpc-ds-benchmark tpc-ds-queries

Last synced: 03 Oct 2025

https://github.com/vinta/albedo

A recommender system for discovering GitHub repos, built with Apache Spark

apache-spark elasticsearch feature-engineering machine-learning python recommender-system scala

Last synced: 09 Aug 2025

https://github.com/lamastex/scalable-data-science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

apache-spark data-science databricks scala

Last synced: 16 May 2025