Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-spark

A curated list of awesome Apache Spark packages and resources.
https://github.com/eliasah/awesome-spark

Last synced: about 8 hours ago
JSON representation

Packages
- Language Bindings
  - Flambo - commit/yieldbot/flambo.svg"> - Clojure DSL.
  - Mobius - commit/Microsoft/Mobius.svg"> - C# bindings.
  - sparklyr - commit/rstudio/sparklyr.svg"> - An alternative R backend, using [`dplyr`](https://github.com/hadley/dplyr).
  - sparkle - commit/tweag/sparkle.svg"> - Haskell on Apache Spark.
- Notebooks and IDEs
  - sparkmagic - commit/jupyter-incubator/sparkmagic.svg"> - [Jupyter](https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through [Livy](https://github.com/cloudera/livy), in Jupyter notebooks.
  - Apache Zeppelin - commit/apache/zeppelin.svg"> - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
- General Purpose Libraries
  - Succinct - commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
  - Succinct - commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
- SQL Data Sources
  - Spark CSV - commit/databricks/spark-csv.svg"> - CSV reader and writer (obsolete since Spark 2.0 [[SPARK-12833]](https://issues.apache.org/jira/browse/SPARK-12833)).
  - Spark Avro - commit/databricks/spark-avro.svg"> - [Apache Avro](https://avro.apache.org/) reader and writer.
  - Spark-Mongodb - commit/Stratio/Spark-MongoDB.svg"> - MongoDB reader and writer.
  - Spark Cassandra Connector - commit/datastax/spark-cassandra-connector.svg"> - Cassandra support including data source and API and support for arbitrary queries.
  - Spark Riak Connector - commit/basho/spark-riak-connector.svg"> - Riak TS & Riak KV connector.
  - Mongo-Spark - commit/mongodb/mongo-spark.svg"> - Official MongoDB connector.
  - OrientDB-Spark - commit/orientechnologies/spark-orientdb.svg"> - Official OrientDB connector.
  - Spark XML - commit/databricks/spark-xml.svg"> - XML parser and writer.
- Bioinformatics
  - ADAM - commit/bigdatagenomics/adam.svg"> - Set of tools designed to analyse genomics data.
  - Hail - commit/hail-is/hail.svg"> - Genetic analysis framework.
- GIS
  - Magellan - commit/harsha2010/magellan.svg"> - Geospatial analytics using Spark.
- Time Series Analytics
  - flint - commit/twosigma/flint.svg"> - A time series library for Apache Spark.
- Graph Processing
  - Mazerunner - commit/neo4j-contrib/neo4j-mazerunner.svg"> - Graph analytics platform on top of Neo4j and GraphX.
  - GraphFrames - commit/graphframes/graphframes.svg"> - Data frame based graph API.
  - neo4j-spark-connector - commit/neo4j-contrib/neo4j-spark-connector.svg"> - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
  - SparklingGraph - commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
- Machine Learning Extension
  - dbscan-on-spark - commit/irvingc/dbscan-on-spark.svg"> - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by [irvingc](https://github.com/irvingc) and based on the paper from He, Yaobin, et al. [MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data](https://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf).
  - spark-sklearn - commit/databricks/spark-sklearn.svg"> - Scikit-learn integration with distributed model training.
  - JPMML-Spark - commit/jpmml/jpmml-spark.svg"> - PMML transformer library for Spark ML.
  - Distributed Keras - commit/cerndb/dist-keras.svg"> - Distributed deep learning framework with PySpark and Keras.
  - ModelDB - commit/mitdbg/modeldb.svg"> - A system to manage machine learning models for `spark.ml` and [`scikit-learn`](https://github.com/scikit-learn/scikit-learn) <img src="https://img.shields.io/github/last-commit/scikit-learn/scikit-learn.svg">.
  - Sparkling Water - commit/h2oai/sparkling-water.svg"> - [H2O](http://www.h2o.ai/) interoperability layer.
  - MLeap - commit/combust/mleap.svg"> - Execution engine and serialization format which supports deployment of `o.a.s.ml` models without dependency on `SparkSession`.
  - Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
  - KeystoneML - Type safe machine learning pipelines with RDDs.
- Middleware
  - Livy - commit/cloudera/livy.svg"> - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
  - spark-jobserver - commit/spark-jobserver/spark-jobserver.svg"> - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
  - Mist - commit/Hydrospheredata/mist.svg"> - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
  - Apache Toree - commit/apache/incubator-toree.svg"> - IPython protocol based middleware for interactive applications.
- Utilities
  - silex - commit/willb/silex.svg"> - Collection of tools varying from ML extensions to additional RDD methods.
  - sparkly - commit/Tubular/sparkly.svg"> - Helpers & syntactic sugar for PySpark.
  - pyspark-stubs - commit/zero323/pyspark-stubs.svg"> - Static type annotations for PySpark.
  - Flintrock - commit/nchammas/flintrock.svg"> - A command-line tool for launching Spark clusters on EC2.
- Natural Language Processing
  - spark-corenlp - commit/databricks/spark-corenlp.svg"> - DataFrame wrapper for [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/).
  - spark-nlp - commit/JohnSnowLabs/spark-nlp.svg"> - Natural language processing library built on top of Apache Spark ML.
- Interfaces
  - Blaze - commit/blaze/blaze.svg"> - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark `DataFrames` and `RDDs`.
  - Apache Beam - commit/apache/beam.svg"> - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
- Testing
  - spark-testing-base - commit/holdenk/spark-testing-base.svg"> - Collection of base test classes.
  - spark-fast-tests - commit/MrPowers/spark-fast-tests.svg"> - A lightweight and fast testing framework.
- Workflow Management
  - Cromwell - commit/broadinstitute/cromwell.svg"> - Workflow management system with [Spark backend](https://github.com/broadinstitute/cromwell#spark-backend).
- Streaming
  - Apache Bahir - commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
Resources
- Books
  - Spark Gotchas - Subjective compilation of tips, tricks and common programming mistakes.
  - Learning Spark, Lightning-Fast Big Data Analysis - Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.
  - Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to [setup Eclipse for Spark application development](http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo [here](https://github.com/spark-in-action/first-edition).
  - Learning Spark, Lightning-Fast Big Data Analysis - Slightly outdated (Spark 1.3) introduction to Spark API. Good source of knowledge about basic concepts.
- Projects Using Spark
  - Oryx 2 - [Lambda architecture](http://lambda-architecture.net/) platform built on Apache Spark and [Apache Kafka](http://kafka.apache.org/) with specialization for real-time large scale machine learning.
  - Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
  - Crossdata - Data integration platform with extended DataSource API and multi-user environment.
  - PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
- Docker Images
  - sequenceiq/docker-spark - Yarn images from [SequenceIQ](http://www.sequenceiq.com/).
  - jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
- Papers
  - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
  - Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
- MOOCS
  - Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses ([Introduction to Apache Spark](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), [Distributed Machine Learning with Apache Spark](https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), [Big Data Analysis with Apache Spark](https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), [Advanced Apache Spark for Data Science and Data Engineering](https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), [Advanced Distributed Machine Learning with Apache Spark](https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented.
- Miscellaneous
  - Apache Spark User List - spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development topics respectively.
- Blogs
  - Spark Technology Center - Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.
  - Spark Technology Center - Great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

Programming Languages

Scala 25 Python 8 Java 5 Haskell 1 Shell 1 Terra 1 R 1 C# 1 Clojure 1

Ecosyste.ms: Awesome

awesome-spark

Packages

Language Bindings

Notebooks and IDEs

General Purpose Libraries

SQL Data Sources

Bioinformatics

GIS

Time Series Analytics

Graph Processing

Machine Learning Extension

Middleware

Utilities

Natural Language Processing

Interfaces

Testing

Workflow Management

Streaming

Resources

Books

Projects Using Spark

Docker Images

Papers

MOOCS

Miscellaneous

Blogs