awesome-spark

A curated list of awesome Apache Spark packages and resources.
https://github.com/awesome-spark/awesome-spark

Last synced: 3 days ago
JSON representation

Packages
- Language Bindings
  - Flambo - commit/yieldbot/flambo.svg"> - Clojure DSL.
  - sparkle - commit/tweag/sparkle.svg"> - Haskell on Apache Spark.
  - Flambo - commit/yieldbot/flambo.svg"> - Clojure DSL.
  - Kotlin for Apache Spark - commit/Kotlin/kotlin-spark-api.svg"> - Kotlin API bindings and extensions.
  - Mobius - commit/Microsoft/Mobius.svg"> - C# bindings (Deprecated in favor of .NET for Apache Spark).
  - .NET for Apache Spark - commit/dotnet/spark.svg"> - .NET bindings.
  - spark-connect-rs - commit/sjrusso8/spark-connect-rs.svg"> - Rust bindings.
  - spark-connect-go - commit/apache/spark-connect-go.svg"> - Golang bindings.
  - spark-connect-csharp - commit/mdrakiburrahman/spark-connect-csharp.svg"> - C# bindings.
- Notebooks and IDEs
  - almond - commit/almond-sh/almond.svg"> - A scala kernel for [Jupyter](https://jupyter.org/).
  - Apache Zeppelin - commit/apache/zeppelin.svg"> - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
  - Polynote - commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from [Netflix](https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447).
  - Spark Notebook - commit/spark-notebook/spark-notebook.svg"> - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
  - sparkmagic - commit/jupyter-incubator/sparkmagic.svg"> - [Jupyter](https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through [Livy](https://github.com/cloudera/livy), in Jupyter notebooks.
- General Purpose Libraries
  - Succinct - commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
  - Apache DataFu - commit/apache/datafu.svg"> - A library of general purpose functions and UDF's.
  - itachi - commit/yaooqinn/itachi.svg"> - A library that brings useful functions from modern database management systems to Apache Spark.
  - spark-daria - commit/mrpowers-io/spark-daria.svg"> - A Scala library with essential Spark functions and extensions to make you more productive.
  - Joblib Apache Spark Backend - commit/joblib/joblib-spark.svg"> - [`joblib`](https://github.com/joblib/joblib) backend for running tasks on Spark clusters.
  - quinn - commit/mrpowers-io/quinn.svg"> - A native PySpark implementation of spark-daria.
  - Apache DataFu - commit/apache/datafu.svg"> - A library of general purpose functions and UDF's.
- SQL Data Sources
  - serveral built-in Data Sources
  - Spark XML - commit/databricks/spark-xml.svg"> - XML parser and writer.
  - Spark Cassandra Connector - commit/datastax/spark-cassandra-connector.svg"> - Cassandra support including data source and API and support for arbitrary queries.
  - Mongo-Spark - commit/mongodb/mongo-spark.svg"> - Official MongoDB connector.
- Storage
  - lakeFS - commit/treeverse/lakefs.svg"> - Integration with the lakeFS atomic versioned storage layer.
  - Delta Lake - commit/delta-io/delta.svg"> - Storage layer with ACID transactions.
  - Apache Hudi - commit/apache/hudi.svg"> - Upserts, Deletes And Incremental Processing on Big Data..
  - Apache Iceberg - commit/apache/iceberg.svg"> - Upserts, Deletes And Incremental Processing on Big Data..
- Graph Processing
  - SparklingGraph - commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
  - neo4j-spark-connector - commit/neo4j-contrib/neo4j-spark-connector.svg"> - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
  - GraphFrames - commit/graphframes/graphframes.svg"> - Data frame based graph API.
- Machine Learning Extension
  - Apache SystemML - commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
  - Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
  - KeystoneML - Type safe machine learning pipelines with RDDs.
  - Microsoft ML for Apache Spark - commit/Azure/mmlspark.svg"> - A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
  - MLflow - commit/mlflow/mlflow.svg"> - Machine learning orchestration platform.
- Utilities
  - Optimus - commit/ironmussa/Optimus.svg"> - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.
- Streaming
  - Apache Bahir - commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
- Interfaces
  - Apache Beam - commit/apache/beam.svg"> - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
- Bioinformatics
  - ADAM - commit/bigdatagenomics/adam.svg"> - Set of tools designed to analyse genomics data.
  - Hail - commit/hail-is/hail.svg"> - Genetic analysis framework.
- GIS
  - Apache Sedona - commit/apache/incubator-sedona.svg"> - Cluster computing system for processing large-scale spatial data.
Resources
- Books
  - Learning Spark, 2nd Edition - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
  - Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: [sryza/aas](https://github.com/sryza/aas).
  - Mastering Apache Spark - Interesting compilation of notes by [Jacek Laskowski](https://github.com/jaceklaskowski). Focused on different aspects of Spark internals.
  - Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to [setup Eclipse for Spark application development](http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo [here](https://github.com/spark-in-action/first-edition).
- Papers
  - Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
  - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
  - Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
  - Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
- MOOCS
  - Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses ([Introduction to Apache Spark](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), [Distributed Machine Learning with Apache Spark](https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), [Big Data Analysis with Apache Spark](https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), [Advanced Apache Spark for Data Science and Data Engineering](https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), [Advanced Distributed Machine Learning with Apache Spark](https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented.
  - Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of [Functional Programming in Scala Specialization](https://www.coursera.org/specializations/scala).
- Workshops
  - AMP Camp - Periodical training event organized by the [UC Berkeley AMPLab](https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the [Berkeley Data Analytics Stack](https://amplab.cs.berkeley.edu/software/).
- Projects Using Spark
  - PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
- Docker Images
  - apache/spark - Apache Spark Official Docker images.
  - jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
  - datamechanics/spark - An easy to setup Docker image for Apache Spark from [Data Mechanics](https://www.datamechanics.co/).
- Miscellaneous
  - Spark with Scala Gitter channel - "_A place to discuss and ask questions about using Scala for Spark programming_" started by [@deanwampler](https://github.com/deanwampler).
  - Apache Spark User List - spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development topics respectively.

Programming Languages

Scala 7 Java 4 Python 4 C# 2 Go 1 Rust 1 Kotlin 1 Haskell 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-spark

Packages

Language Bindings

Notebooks and IDEs

General Purpose Libraries

SQL Data Sources

Storage

Graph Processing

Machine Learning Extension

Utilities

Streaming

Interfaces

Bioinformatics

GIS

Resources

Books

Papers

MOOCS

Workshops

Projects Using Spark

Docker Images

Miscellaneous