awesome-spark
A curated list of awesome Apache Spark packages and resources.
https://github.com/awesome-spark/awesome-spark
Last synced: 9 days ago
JSON representation
-
Packages
-
Interfaces
- Apache Beam - commit/apache/beam.svg"> - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
- Koalas - commit/databricks/koalas.svg"> - Pandas DataFrame API on top of Apache Spark.
-
Notebooks and IDEs
- Polynote - commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from [Netflix](https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447).
- Apache Zeppelin - commit/apache/zeppelin.svg"> - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
- almond - commit/almond-sh/almond.svg"> - A scala kernel for [Jupyter](https://jupyter.org/).
- sparkmagic - commit/jupyter-incubator/sparkmagic.svg"> - [Jupyter](https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through [Livy](https://github.com/cloudera/livy), in Jupyter notebooks.
-
Language Bindings
- sparkle - commit/tweag/sparkle.svg"> - Haskell on Apache Spark.
- Mobius - commit/Microsoft/Mobius.svg"> - C# bindings (Deprecated in favor of .NET for Apache Spark).
- Kotlin for Apache Spark - commit/Kotlin/kotlin-spark-api.svg"> - Kotlin API bindings and extensions.
- .NET for Apache Spark - commit/dotnet/spark.svg"> - .NET bindings.
- spark-connect-rs - commit/sjrusso8/spark-connect-rs.svg"> - Rust bindings.
- spark-connect-go - commit/apache/spark-connect-go.svg"> - Golang bindings.
- spark-connect-csharp - commit/mdrakiburrahman/spark-connect-csharp.svg"> - C# bindings.
-
General Purpose Libraries
- Succinct - commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
- Apache DataFu - commit/apache/datafu.svg"> - A library of general purpose functions and UDF's.
- Joblib Apache Spark Backend - commit/joblib/joblib-spark.svg"> - [`joblib`](https://github.com/joblib/joblib) backend for running tasks on Spark clusters.
- itachi - commit/yaooqinn/itachi.svg"> - A library that brings useful functions from modern database management systems to Apache Spark.
- spark-daria - commit/mrpowers-io/spark-daria.svg"> - A Scala library with essential Spark functions and extensions to make you more productive.
- quinn - commit/mrpowers-io/quinn.svg"> - A native PySpark implementation of spark-daria.
- Apache DataFu - commit/apache/datafu.svg"> - A library of general purpose functions and UDF's.
-
SQL Data Sources
- serveral built-in Data Sources
- Spark XML - commit/databricks/spark-xml.svg"> - XML parser and writer.
- Mongo-Spark - commit/mongodb/mongo-spark.svg"> - Official MongoDB connector.
- Spark Cassandra Connector - commit/datastax/spark-cassandra-connector.svg"> - Cassandra support including data source and API and support for arbitrary queries.
-
Storage
- lakeFS - commit/treeverse/lakefs.svg"> - Integration with the lakeFS atomic versioned storage layer.
- Apache Hudi - commit/apache/hudi.svg"> - Upserts, Deletes And Incremental Processing on Big Data..
- Delta Lake - commit/delta-io/delta.svg"> - Storage layer with ACID transactions.
- Apache Iceberg - commit/apache/iceberg.svg"> - Upserts, Deletes And Incremental Processing on Big Data..
-
Graph Processing
- SparklingGraph - commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
- GraphFrames - commit/graphframes/graphframes.svg"> - Data frame based graph API.
-
Machine Learning Extension
- Apache SystemML - commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
- Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
- KeystoneML - Type safe machine learning pipelines with RDDs.
- MLflow - commit/mlflow/mlflow.svg"> - Machine learning orchestration platform.
- MLeap - commit/combust/mleap.svg"> - Execution engine and serialization format which supports deployment of `o.a.s.ml` models without dependency on `SparkSession`.
- Sparkling Water - commit/h2oai/sparkling-water.svg"> - [H2O](http://www.h2o.ai/) interoperability layer.
- ModelDB - commit/mitdbg/modeldb.svg"> - A system to manage machine learning models for `spark.ml` and [`scikit-learn`](https://github.com/scikit-learn/scikit-learn) <img src="https://img.shields.io/github/last-commit/scikit-learn/scikit-learn.svg">.
- BigDL - commit/intel-analytics/BigDL.svg"> - Distributed Deep Learning library.
- KeystoneML - Type safe machine learning pipelines with RDDs.
- Microsoft ML for Apache Spark - commit/Azure/mmlspark.svg"> - A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
- Apache SystemML - commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
- JPMML-Spark - commit/jpmml/jpmml-spark.svg"> - PMML transformer library for Spark ML.
- MLflow - commit/mlflow/mlflow.svg"> - Machine learning orchestration platform.
-
Streaming
- Apache Bahir - commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
-
Workflow Management
- Cromwell - commit/broadinstitute/cromwell.svg"> - Workflow management system with [Spark backend](https://github.com/broadinstitute/cromwell#spark-backend).
-
Middleware
- spark-jobserver - commit/spark-jobserver/spark-jobserver.svg"> - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
- Apache Kyuubi - commit/apache/kyuubi.svg"> - A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.
- Apache Toree - commit/apache/incubator-toree.svg"> - IPython protocol based middleware for interactive applications.
- Livy - commit/apache/incubator-livy.svg"> - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
-
Data quality
- deequ - commit/awslabs/deequ.svg"> - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
- python-deequ - commit/awslabs/python-deequ.svg"> - Python API for Deequ.
-
Natural Language Processing
- spark-nlp - commit/JohnSnowLabs/spark-nlp.svg"> - Natural language processing library built on top of Apache Spark ML.
-
Web Archives
- Archives Unleashed Toolkit - commit/archivesunleashed/aut.svg"> - Open-source toolkit for analyzing web archives.
-
Testing
- chispa - commit/MrPowers/chispa.svg"> - PySpark test helpers with beautiful error messages.
- spark-testing-base - commit/holdenk/spark-testing-base.svg"> - Collection of base test classes.
- spark-fast-tests - commit/mrpowers-io/spark-fast-tests.svg"> - A lightweight and fast testing framework.
-
Bioinformatics
-
Utilities
- sparkly - commit/Tubular/sparkly.svg"> - Helpers & syntactic sugar for PySpark.
- Flintrock - commit/nchammas/flintrock.svg"> - A command-line tool for launching Spark clusters on EC2.
- Optimus - commit/ironmussa/Optimus.svg"> - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.
-
Monitoring
- Data Mechanics Delight - commit/datamechanics/delight.svg"> - Cross-platform monitoring tool (Spark UI / Spark History Server replacement).
-
GIS
- Apache Sedona - commit/apache/incubator-sedona.svg"> - Cluster computing system for processing large-scale spatial data.
-
-
Resources
-
Books
- Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to [setup Eclipse for Spark application development](http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo [here](https://github.com/spark-in-action/first-edition).
- Learning Spark, 2nd Edition - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
- Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: [sryza/aas](https://github.com/sryza/aas).
- Mastering Apache Spark - Interesting compilation of notes by [Jacek Laskowski](https://github.com/jaceklaskowski). Focused on different aspects of Spark internals.
- Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: [sryza/aas](https://github.com/sryza/aas).
- Mastering Apache Spark - Interesting compilation of notes by [Jacek Laskowski](https://github.com/jaceklaskowski). Focused on different aspects of Spark internals.
-
Papers
- Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
- Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
- Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.
-
MOOCS
- Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses ([Introduction to Apache Spark](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), [Distributed Machine Learning with Apache Spark](https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), [Big Data Analysis with Apache Spark](https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), [Advanced Apache Spark for Data Science and Data Engineering](https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), [Advanced Distributed Machine Learning with Apache Spark](https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented.
- Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of [Functional Programming in Scala Specialization](https://www.coursera.org/specializations/scala).
- Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of [Functional Programming in Scala Specialization](https://www.coursera.org/specializations/scala).
-
Workshops
- AMP Camp - Periodical training event organized by the [UC Berkeley AMPLab](https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the [Berkeley Data Analytics Stack](https://amplab.cs.berkeley.edu/software/).
- AMP Camp - Periodical training event organized by the [UC Berkeley AMPLab](https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the [Berkeley Data Analytics Stack](https://amplab.cs.berkeley.edu/software/).
-
Projects Using Spark
- PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
- Oryx 2 - [Lambda architecture](http://lambda-architecture.net/) platform built on Apache Spark and [Apache Kafka](http://kafka.apache.org/) with specialization for real-time large scale machine learning.
- Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
- Crossdata - Data integration platform with extended DataSource API and multi-user environment.
-
Docker Images
- apache/spark - Apache Spark Official Docker images.
- jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
- datamechanics/spark - An easy to setup Docker image for Apache Spark from [Data Mechanics](https://www.datamechanics.co/).
- sequenceiq/docker-spark - Yarn images from [SequenceIQ](http://www.sequenceiq.com/).
-
Miscellaneous
- Apache Spark User List - spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development topics respectively.
- Spark with Scala Gitter channel - "_A place to discuss and ask questions about using Scala for Spark programming_" started by [@deanwampler](https://github.com/deanwampler).
-
Programming Languages
Categories
Sub Categories
Machine Learning Extension
13
General Purpose Libraries
7
Language Bindings
7
Papers
6
Books
6
SQL Data Sources
4
Docker Images
4
Middleware
4
Projects Using Spark
4
Notebooks and IDEs
4
Storage
4
Utilities
3
Testing
3
MOOCS
3
Miscellaneous
3
Data quality
2
Interfaces
2
Workshops
2
Graph Processing
2
Bioinformatics
2
GIS
1
Web Archives
1
Streaming
1
Natural Language Processing
1
Workflow Management
1
Monitoring
1
Keywords
spark
24
apache-spark
9
scala
8
pyspark
7
big-data
6
dataframe
5
bigdata
5
machine-learning
4
python
4
spark-sql
3
bioinformatics
3
analytics
3
java
2
tensorflow
2
streaming
2
spark-streaming
2
transformers
2
genomics
2
hadoop
2
fsharp
2
hive
2
kubernetes
2
csharp
2
livy
2
workflow-execution
1
apache-kafka
1
cloudera
1
notebook
1
magic
1
kafka
1
lambda-architecture
1
kernel
1
oryx
1
kerberos
1
apache
1
jupyter-notebook
1
iceberg
1
data-pipelines
1
jupyter
1
scikit-learn
1
cluster
1
unit-testing
1
dataquality
1
thrift
1
sql
1
rest-api
1
spark-jobserver
1
data-lake
1
jdbc
1
mapreduce
1