Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-spark
A curated list of awesome Apache Spark packages and resources.
https://github.com/eliasah/awesome-spark
- Flambo - commit/yieldbot/flambo.svg"> - Clojure DSL.
- Mobius - commit/Microsoft/Mobius.svg"> - C# bindings.
- sparklyr - commit/rstudio/sparklyr.svg"> - An alternative R backend, using [`dplyr`](https://github.com/hadley/dplyr).
- sparkle - commit/tweag/sparkle.svg"> - Haskell on Apache Spark.
- Apache Zeppelin - commit/apache/zeppelin.svg"> - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
- Spark Notebook - commit/spark-notebook/spark-notebook.svg"> - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
- sparkmagic - commit/jupyter-incubator/sparkmagic.svg"> - [Jupyter](https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through [Livy](https://github.com/cloudera/livy), in Jupyter notebooks.
- Succinct - commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
- Spark CSV - commit/databricks/spark-csv.svg"> - CSV reader and writer (obsolete since Spark 2.0 [[SPARK-12833]](https://issues.apache.org/jira/browse/SPARK-12833)).
- Spark Avro - commit/databricks/spark-avro.svg"> - [Apache Avro](https://avro.apache.org/) reader and writer.
- Spark XML - commit/databricks/spark-xml.svg"> - XML parser and writer.
- Spark-Mongodb - commit/Stratio/Spark-MongoDB.svg"> - MongoDB reader and writer.
- Spark Cassandra Connector - commit/datastax/spark-cassandra-connector.svg"> - Cassandra support including data source and API and support for arbitrary queries.
- Spark Riak Connector - commit/basho/spark-riak-connector.svg"> - Riak TS & Riak KV connector.
- Mongo-Spark - commit/mongodb/mongo-spark.svg"> - Official MongoDB connector.
- OrientDB-Spark - commit/orientechnologies/spark-orientdb.svg"> - Official OrientDB connector.
- ADAM - commit/bigdatagenomics/adam.svg"> - Set of tools designed to analyse genomics data.
- Hail - commit/hail-is/hail.svg"> - Genetic analysis framework.
- Magellan - commit/harsha2010/magellan.svg"> - Geospatial analytics using Spark.
- GeoSpark - commit/Sarwat/GeoSpark.svg"> - Cluster computing system for processing large-scale spatial data.
- Spark-Timeseries - commit/cloudera/spark-timeseries.svg"> - Scala / Java / Python library for interacting with time series data on Apache Spark.
- flint - commit/twosigma/flint.svg"> - A time series library for Apache Spark.
- Mazerunner - commit/neo4j-contrib/neo4j-mazerunner.svg"> - Graph analytics platform on top of Neo4j and GraphX.
- GraphFrames - commit/graphframes/graphframes.svg"> - Data frame based graph API.
- neo4j-spark-connector - commit/neo4j-contrib/neo4j-spark-connector.svg"> - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
- SparklingGraph - commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
- dbscan-on-spark - commit/irvingc/dbscan-on-spark.svg"> - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by [irvingc](https://github.com/irvingc) and based on the paper from He, Yaobin, et al. [MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data](https://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf).
- Apache SystemML - commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
- Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
- spark-sklearn - commit/databricks/spark-sklearn.svg"> - Scikit-learn integration with distributed model training.
- KeystoneML - Type safe machine learning pipelines with RDDs.
- JPMML-Spark - commit/jpmml/jpmml-spark.svg"> - PMML transformer library for Spark ML.
- Distributed Keras - commit/cerndb/dist-keras.svg"> - Distributed deep learning framework with PySpark and Keras.
- ModelDB - commit/mitdbg/modeldb.svg"> - A system to manage machine learning models for `spark.ml` and [`scikit-learn`](https://github.com/scikit-learn/scikit-learn) <img src="https://img.shields.io/github/last-commit/scikit-learn/scikit-learn.svg">.
- Sparkling Water - commit/h2oai/sparkling-water.svg"> - [H2O](http://www.h2o.ai/) interoperability layer.
- BigDL - commit/intel-analytics/BigDL.svg"> - Distributed Deep Learning library.
- MLeap - commit/combust/mleap.svg"> - Execution engine and serialization format which supports deployment of `o.a.s.ml` models without dependency on `SparkSession`.
- Livy - commit/cloudera/livy.svg"> - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
- spark-jobserver - commit/spark-jobserver/spark-jobserver.svg"> - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
- Mist - commit/Hydrospheredata/mist.svg"> - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
- Apache Toree - commit/apache/incubator-toree.svg"> - IPython protocol based middleware for interactive applications.
- silex - commit/willb/silex.svg"> - Collection of tools varying from ML extensions to additional RDD methods.
- sparkly - commit/Tubular/sparkly.svg"> - Helpers & syntactic sugar for PySpark.
- pyspark-stubs - commit/zero323/pyspark-stubs.svg"> - Static type annotations for PySpark.
- Flintrock - commit/nchammas/flintrock.svg"> - A command-line tool for launching Spark clusters on EC2.
- spark-corenlp - commit/databricks/spark-corenlp.svg"> - DataFrame wrapper for [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/).
- spark-nlp - commit/JohnSnowLabs/spark-nlp.svg"> - Natural language processing library built on top of Apache Spark ML.
- Apache Bahir - commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
- Apache Beam - commit/apache/beam.svg"> - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
- Blaze - commit/blaze/blaze.svg"> - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark `DataFrames` and `RDDs`.
- spark-testing-base - commit/holdenk/spark-testing-base.svg"> - Collection of base test classes.
- spark-fast-tests - commit/MrPowers/spark-fast-tests.svg"> - A lightweight and fast testing framework.
- Cromwell - commit/broadinstitute/cromwell.svg"> - Workflow management system with [Spark backend](https://github.com/broadinstitute/cromwell#spark-backend).
- Learning Spark, Lightning-Fast Big Data Analysis - Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.
- Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: [sryza/aas](https://github.com/sryza/aas).
- Mastering Apache Spark - Interesting compilation of notes by [Jacek Laskowski](https://github.com/jaceklaskowski). Focused on different aspects of Spark internals.
- Spark Gotchas - Subjective compilation of tips, tricks and common programming mistakes.
- Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to [setup Eclipse for Spark application development](http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo [here](https://github.com/spark-in-action/first-edition).
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
- Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
- Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses ([Introduction to Apache Spark](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), [Distributed Machine Learning with Apache Spark](https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), [Big Data Analysis with Apache Spark](https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), [Advanced Apache Spark for Data Science and Data Engineering](https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), [Advanced Distributed Machine Learning with Apache Spark](https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented.
- Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of [Functional Programming in Scala Specialization](https://www.coursera.org/specializations/scala).
- AMP Camp - Periodical training event organized by the [UC Berkeley AMPLab](https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the [Berkeley Data Analytics Stack](https://amplab.cs.berkeley.edu/software/).
- Oryx 2 - [Lambda architecture](http://lambda-architecture.net/) platform built on Apache Spark and [Apache Kafka](http://kafka.apache.org/) with specialization for real-time large scale machine learning.
- Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
- PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
- Crossdata - Data integration platform with extended DataSource API and multi-user environment.
- Spark Technology Center - Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.
- jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
- sequenceiq/docker-spark - Yarn images from [SequenceIQ](http://www.sequenceiq.com/).
- Spark with Scala Gitter channel - "_A place to discuss and ask questions about using Scala for Spark programming_" started by [@deanwampler](https://github.com/deanwampler).
- Apache Spark User List - spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development topics respectively.
- sindresorhus/awesome
Keywords
spark
14
apache-spark
8
scala
6
pyspark
6
machine-learning
5
python
5
big-data
4
tensorflow
3
bioinformatics
3
java
2
scikit-learn
2
genomics
2
transformers
2
geospatial-processing
1
geospatial-analytics
1
geospatial-analysis
1
geospatial
1
geometric-algorithms
1
geojson
1
vcf
1
software
1
hail
1
gwas
1
genetics
1
r
1
magellan
1
shapefile
1
sparksql
1
timeseries
1
bolt
1
cypher
1
neo4j-connector
1
neo4j-driver
1
grid-search
1
bigdata
1
parameter-tuning
1
data-parallelism
1
data-science
1
deep-learning
1
fsharp
1
kafka-streaming
1
mapreduce
1
mobius
1
near-real-time
1
rdd
1
eventhubs
1
spark-streaming
1
streaming
1
cluster
1
jupyter
1