Projects in Awesome Lists tagged with apache-spark
A curated list of projects in awesome lists tagged with apache-spark .
https://github.com/mlflow/mlflow
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering
Last synced: 27 Dec 2025
https://github.com/microsoft/synapseml
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 13 May 2025
https://microsoft.github.io/SynapseML/
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 29 Apr 2025
https://github.com/microsoft/SynapseML
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 14 Mar 2025
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 24 Dec 2025
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 20 Mar 2025
https://github.com/lw-lin/coolplayspark
酷玩 Spark: Spark 源代码解析、Spark 类库等
apache-spark spark spark-streaming sparkcore structured-streaming
Last synced: 14 May 2025
https://github.com/lw-lin/CoolplaySpark
酷玩 Spark: Spark 源代码解析、Spark 类库等
apache-spark spark spark-streaming sparkcore structured-streaming
Last synced: 04 Apr 2025
https://github.com/spark-notebook/spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
apache-spark data-science notebook reactive scala spark
Last synced: 14 May 2025
https://github.com/kubeflow/spark-operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark
Last synced: 25 Apr 2025
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
apache-spark google-cloud-dataproc kubernetes kubernetes-controller kubernetes-crd kubernetes-operator spark
Last synced: 24 Apr 2025
https://github.com/intel/bigdl
BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
analytics-zoo apache-spark bigdl deep-neural-network distributed-deep-learning keras-tensorflow python pytorch scala
Last synced: 14 May 2025
https://github.com/dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
analytics apache-spark azure bigdata csharp databricks dotnet dotnet-core dotnet-standard emr fsharp hdinsight machine-learning microsoft spark spark-sql spark-streaming streaming tpcds tpch
Last synced: 11 May 2025
https://github.com/big-data-europe/docker-spark
Apache Spark docker image
apache-spark docker k8s-spark kubernetes spark-kubernetes
Last synced: 15 May 2025
https://github.com/feathr-ai/feathr
Feathr – A scalable, unified data and AI engineering platform for enterprise
apache-spark artificial-intelligence azure data-engineering data-quality data-science feature-engineering feature-governance feature-management feature-marketplace feature-metadata feature-platform feature-store machine-learning mlops
Last synced: 14 May 2025
https://github.com/OryxProject/oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx
Last synced: 27 Mar 2025
https://github.com/oryxproject/oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
apache-kafka apache-spark cloudera java kafka lambda-architecture machine-learning oryx
Last synced: 03 Oct 2025
https://github.com/japila-books/apache-spark-internals
The Internals of Apache Spark
apache-spark book internals spark
Last synced: 15 May 2025
https://github.com/ptyadana/sql-data-analysis-and-visualization-projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
apache-spark challenges data-analysis digital-music-store exercises mysql mysql-database mysql-notes mysqlworkbench pgadmin postgres postgresql pyspark python sql sql-data-analysis sql-queries sqlite tableau
Last synced: 16 May 2025
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 16 May 2025
https://github.com/databricks/learningsparkv2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
apache-spark delta-lake mlflow mllib spark spark-mllib spark-sql structured-streaming
Last synced: 14 May 2025
https://github.com/lensacom/sparkit-learn
PySpark + Scikit-learn = Sparkit-learn
apache-spark distributed-computing machine-learning python scikit-learn
Last synced: 15 May 2025
https://github.com/databricks/spark-sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
apache-spark grid-search machine-learning parameter-tuning scikit-learn
Last synced: 30 Sep 2025
https://github.com/mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
apache-hadoop apache-spark data-algorithms design-patterns distributed-algorithms distributed-computing hadoop-mapreduce java machine-learning mappers mapreduce partitioning pyspark python reducers scala
Last synced: 14 May 2025
https://github.com/graphframes/graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
apache-spark big-data connected-components dataframe dataframes graphs network-motif network-motifs networks spark
Last synced: 14 May 2025
https://github.com/sparklyr/sparklyr
R interface for Apache Spark
apache-spark distributed dplyr ide livy machine-learning r remote-clusters rstats spark sparklyr
Last synced: 14 May 2025
https://github.com/microsoft/mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 14 May 2025
https://github.com/Microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 14 Mar 2025
https://github.com/microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
apache-spark bigdata csharp dataframe dataset dstream eventhubs fsharp kafka-streaming mapreduce mobius near-real-time rdd spark spark-streaming streaming
Last synced: 08 Apr 2025
https://github.com/lucacanali/sparkmeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
apache-spark performance-metrics performance-troubleshooting python scala spark
Last synced: 14 May 2025
https://github.com/lw-lin/streaming-readings
Streaming System 相关的论文读物
apache-spark dataflow drizzle flink heron millwheel s4 spark-streaming spe storm stream-processing stream-processing-engine streaming streaming-engine
Last synced: 04 Apr 2025
https://github.com/aloneguid/parquet-dotnet
Fully managed Apache Parquet implementation
apache-parquet apache-spark dotnet dotnet-core dotnet-standard ios linux windows xamarin xbox
Last synced: 13 May 2025
https://github.com/miguno/kafka-storm-starter
[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
apache-avro apache-kafka apache-spark apache-storm avro integration kafka scala spark storm
Last synced: 17 Dec 2025
https://github.com/LucaCanali/sparkMeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
apache-spark performance-metrics performance-troubleshooting python scala spark
Last synced: 18 Jul 2025
https://github.com/mrpowers-io/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Last synced: 14 Apr 2025
https://github.com/nchammas/flintrock
A command-line tool for launching Apache Spark clusters.
apache-spark apache-spark-cluster ec2 orchestration spark-ec2
Last synced: 14 May 2025
https://github.com/cerndb/dist-keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
apache-spark data-parallelism data-science deep-learning distributed-optimizers hadoop keras machine-learning optimization-algorithms tensorflow
Last synced: 03 Oct 2025
https://github.com/apache-spark-on-k8s/spark
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
apache-spark kubernetes kubernetes-cluster
Last synced: 03 Oct 2025
https://github.com/openscoring/openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
apache-spark api lightgbm pmml r real-time scikit-learn xgboost
Last synced: 15 May 2025
https://github.com/japila-books/spark-sql-internals
The Internals of Spark SQL
apache-spark book internals mkdocs-material spark spark-sql
Last synced: 15 May 2025
https://github.com/rjurney/Agile_Data_Code_2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant
Last synced: 19 Jul 2025
https://github.com/rjurney/agile_data_code_2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
agile-data agile-data-science airflow amazon-ec2 amazon-web-services analytics apache-kafka apache-spark data data-science data-syndrome kafka machine-learning machine-learning-algorithms predictive-analytics python python-3 python3 spark vagrant
Last synced: 12 Apr 2025
https://github.com/tweag/sparkle
Haskell on Apache Spark.
analytics apache-spark haskell spark
Last synced: 16 May 2025
https://github.com/lucacanali/miscellaneous
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's performance. Jupyter notebooks examples for using various DB systems.
apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing
Last synced: 16 May 2025
https://github.com/cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
apache-spark big-data pyspark spark
Last synced: 29 Oct 2025
https://github.com/LucaCanali/Miscellaneous
Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark, tools for performance testing CPUs, Jupyter notebooks examples for Spark, examples for Oracle and other DB systems.
apache-spark database jupyter-notebooks performance-analysis performance-monitoring performance-testing
Last synced: 13 Apr 2025
https://github.com/japila-books/spark-structured-streaming-internals
The Internals of Spark Structured Streaming
apache-spark book internals mkdocs-material spark structured-streaming
Last synced: 05 Apr 2025
https://github.com/ekampf/pyspark-boilerplate
A boilerplate for writing PySpark Jobs
apache-spark boilerplate pyspark python
Last synced: 05 Apr 2025
https://github.com/tirthajyoti/spark-with-python
Fundamentals of Spark with Python (using PySpark), code examples
analytics apache apache-spark big-data database dataframe distributed-computing hadoop hdfs machine-learning map-reduce mlib parallel-computing pyspark python spark sql
Last synced: 05 Apr 2025
https://github.com/datamechanics/delight
A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
apache-spark cpu dashboard delight kubernetes memory monitoring netapp-public spark spark-history-server spark-ui
Last synced: 03 Oct 2025
https://github.com/dmmiller612/sparktorch
Train and run Pytorch models on Apache Spark.
apache-spark deep-learning distributed-computing inference pipelines pytorch sparktorch
Last synced: 05 Apr 2025
https://github.com/opencypher/morpheus
Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
apache-spark apache2 big-data cypher graph scala
Last synced: 05 Apr 2025
https://github.com/miguno/wirbelsturm
[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
apache-kafka apache-spark apache-storm kafka puppet spark storm vagrant
Last synced: 03 Oct 2025
https://github.com/Hydrospheredata/mist
Serverless proxy for Spark cluster
apache-spark api big-data serverless
Last synced: 27 Mar 2025
https://github.com/hydrospheredata/mist
Serverless proxy for Spark cluster
apache-spark api big-data serverless
Last synced: 05 Apr 2025
https://github.com/microsoft/data-accelerator
Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.
apache-spark azure big-data cosmosdb docker eventhub hdinsight iot iothub kafka kafka-streams nodejs react servicefabric spark spark-sql spark-streaming sparksql streaming streaming-data
Last synced: 15 May 2025
https://github.com/mingchen0919/learning-apache-spark
Notes on Apache Spark (pyspark)
apache-spark machine-learning pyspark-tutorial
Last synced: 06 Apr 2025
https://github.com/MingChen0919/learning-apache-spark
Notes on Apache Spark (pyspark)
apache-spark machine-learning pyspark-tutorial
Last synced: 26 Mar 2025
https://github.com/lifeomic/sparkflow
Easy to use library to bring Tensorflow on Apache Spark
apache-spark dataframe deep-learning lifeomic pipeline spark-ml tensorflow
Last synced: 04 Apr 2025
https://github.com/josephmachado/efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
apache-spark data-engineering data-pipeline minio pyspark pyspark-notebook
Last synced: 15 Apr 2025
https://github.com/cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook
Last synced: 07 Apr 2025
https://github.com/svenkreiss/pysparkling
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
apache-spark data-processing data-science python
Last synced: 07 Apr 2025
https://github.com/hortonworks-spark/spark-atlas-connector
A Spark Atlas connector to track data lineage in Apache Atlas
Last synced: 28 Oct 2025
https://github.com/jaceklaskowski/spark-workshop
Apache Spark™ and Scala Workshops
apache-spark spark spark-mllib spark-sql spark-structured-streaming spark-workshops workshop
Last synced: 05 Apr 2025
https://github.com/PiercingDan/spark-Jupyter-AWS
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 19 Jul 2025
https://github.com/piercingdan/spark-jupyter-aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
apache-spark apache-spark-cluster aws aws-ec2 aws-s3 ebs-volumes ec2 ec2-instance jupyter jupyter-notebook spark spark-clusters
Last synced: 12 May 2025
https://github.com/dataflint/spark
Performance Observability for Apache Spark
apache-spark big-data data-pipeline data-pipelines databricks dataproc emr etl observability optimization spark-operator
Last synced: 12 Apr 2025
https://github.com/mellanox/sparkrdma
This is archive of SparkRDMA project. The new repository with RDMA shuffle acceleration for Apache Spark is here: https://github.com/Nvidia/sparkucx
apache-spark big-data bigdata disni hadoop infiniband java mellanox rdma roce scala shuffle spark
Last synced: 03 Oct 2025
https://github.com/airscholar/e2e-data-engineering
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics
Last synced: 16 May 2025
https://github.com/azure/azure-event-hubs-spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
apache apache-spark azure bigdata connector continuous databricks event-hubs eventhubs ingestion kafka microsoft real-time scala spark spark-streaming stream streaming structured-streaming
Last synced: 15 May 2025
https://github.com/chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 13 Apr 2025
https://github.com/Chabane/bigdata-playground
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
angular apache-flink apache-spark avro big-data docker graphql hadoop hbase kafka kops machine-learning mongodb nodejs parquet python scala spark-sql spark-streaming twitter-api
Last synced: 28 Apr 2025
https://github.com/Azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 10 May 2025
https://github.com/azure/azure-cosmosdb-spark
Apache Spark Connector for Azure Cosmos DB
apache-spark azure-cosmos-db azure-databricks changefeed connector cosmos-db databricks databricks-notebooks jupyter-notebook lambda-architecture pyspark spark
Last synced: 02 Mar 2025
https://github.com/lynnlangit/learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
apache-spark dataproc emr hadoop learning-hadoop mapreduce spark wordcount
Last synced: 16 May 2025
https://github.com/databrickslabs/automl-toolkit
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
apache-spark feature-engineering machinelearning ml pyspark scala spark
Last synced: 03 Oct 2025
https://github.com/whylabs/whylogs-java
Profile and monitor your ML data pipeline end-to-end
ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs
Last synced: 03 Oct 2025
https://github.com/ibm/spark-tpc-ds-performance-test
Use the TPC-DS benchmark to test Spark SQL performance
apache-spark ibm-developer-technology-cognitive ibmcode jupyter-notebook tpc-ds-benchmark tpc-ds-queries
Last synced: 03 Oct 2025
https://github.com/vinta/albedo
A recommender system for discovering GitHub repos, built with Apache Spark
apache-spark elasticsearch feature-engineering machine-learning python recommender-system scala
Last synced: 09 Aug 2025
https://github.com/lamastex/scalable-data-science
Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.
apache-spark data-science databricks scala
Last synced: 16 May 2025
https://github.com/radanalyticsio/spark-operator
Operator for managing the Spark clusters on Kubernetes and OpenShift.
apache-spark kubernetes kubernetes-operator openshift spark
Last synced: 07 May 2025
https://github.com/mahmoudparsian/big-data-mapreduce-course
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
algorithms apache-hadoop apache-spark big-data data-algorithms data-analysis data-engineering data-partition data-transformation glossary mapreduce mapreduce-algorithm mapreduce-python monoid partitioning-algorithms pyspark pyspark-algorithms-book santa-clara-university spark-dataframes spark-rdd
Last synced: 12 Apr 2025
https://github.com/BitwiseInc/Hydrograph
A visual ETL development and debugging tool for big data
apache-spark big-data cascading etl etl-framework
Last synced: 03 Apr 2025
https://github.com/bitwiseinc/hydrograph
A visual ETL development and debugging tool for big data
apache-spark big-data cascading etl etl-framework
Last synced: 24 Oct 2025
https://github.com/qubole/spark-on-lambda
Apache Spark on AWS Lambda
apache-spark aws aws-cloud aws-lambda big-data lambda serverless spark
Last synced: 07 Apr 2025
https://github.com/SANSA-Stack/SANSA-Stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 09 Jul 2025
https://github.com/sansa-stack/sansa-stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
apache-jena apache-spark distributed-computing flink rdf semantic-web spark
Last synced: 04 Apr 2025
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 13 Apr 2025
https://github.com/gtkcyber/griffon-vm
Griffon Data Science Virtual Machine
apache-drill apache-spark big-data data-science database elasticsearch hadoop jupyter-notebook mysql node-js python r ruby scala virtual-machine
Last synced: 29 Oct 2025
https://github.com/memverge/splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
apache-spark bigdata disaggregation elasticity java scala shuffle spark storage
Last synced: 02 Sep 2025
https://github.com/jleetutorial/scala-spark-tutorial
Project for James' Apache Spark with Scala course
Last synced: 09 Apr 2025
https://github.com/googlecloudplatform/dataproc-templates
Dataproc templates and pipelines for solving simple in-cloud data tasks
apache-spark bigquery gcp google-cloud google-cloud-platform jupyter-notebook pyspark
Last synced: 15 May 2025
https://github.com/learningjournal/spark-streaming-in-python
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata data-lake pyspark python spark-sql spark-streaming
Last synced: 04 Sep 2025
https://github.com/zero323/pyspark-stubs
Apache (Py)Spark type annotations (stub files).
apache-spark mypy pep484 pyspark python python-3 stub-files type-annotations
Last synced: 03 Oct 2025
https://github.com/streamnative/pulsar-spark
Spark Connector to read and write with Pulsar
apache-pulsar apache-spark batch-processing data-processing data-science flink spark spark-sql stream-processing structured-streaming
Last synced: 16 May 2025
https://github.com/vivek-bombatkar/spark-with-python---my-learning-notes-
ETL pipeline using pyspark (Spark - Python)
apache-spark catalyst-optimizer python spark tungsten
Last synced: 29 Oct 2025
https://github.com/jgperrin/net.jgp.books.spark.ch01
Spark in Action, 2nd edition - chapter 1 - Introduction
apache-spark java java8 manning spark sparkwithjava
Last synced: 20 Aug 2025
https://github.com/g-research/fasttrackml
Experiment tracking server focused on speed and scalability
ai apache-spark data-science data-visualization experiment-tracking machine-learning metadata metadata-tracking metrics ml mlflow mlflow-tracking-server mlops pytorch tensorboard tensorflow visualization
Last synced: 16 May 2025
https://github.com/G-Research/fasttrackml
Experiment tracking server focused on speed and scalability
ai apache-spark data-science data-visualization experiment-tracking machine-learning metadata metadata-tracking metrics ml mlflow mlflow-tracking-server mlops pytorch tensorboard tensorflow visualization
Last synced: 26 Jul 2025
https://github.com/dimajix/flowman
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
apache-spark big-data bigdata data-engineering etl flowman hadoop scala spark sql
Last synced: 04 Apr 2025