Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-spark

A curated list of awesome Apache Spark packages and resources.
https://github.com/awesome-spark/awesome-spark

Last synced: 5 days ago
JSON representation

  • Packages

    • Language Bindings

      • Flambo - commit/yieldbot/flambo.svg"> - Clojure DSL.
      • sparkle - commit/tweag/sparkle.svg"> - Haskell on Apache Spark.
      • Flambo - commit/yieldbot/flambo.svg"> - Clojure DSL.
      • Kotlin for Apache Spark - commit/Kotlin/kotlin-spark-api.svg"> - Kotlin API bindings and extensions.
      • Mobius - commit/Microsoft/Mobius.svg"> - C# bindings (Deprecated in favor of .NET for Apache Spark).
      • .NET for Apache Spark - commit/dotnet/spark.svg"> - .NET bindings.
      • spark-connect-rs - commit/sjrusso8/spark-connect-rs.svg"> - Rust bindings.
      • spark-connect-go - commit/apache/spark-connect-go.svg"> - Golang bindings.
      • spark-connect-csharp - commit/mdrakiburrahman/spark-connect-csharp.svg"> - C# bindings.
      • sparklyr - commit/rstudio/sparklyr.svg"> - An alternative R backend, using [`dplyr`](https://github.com/hadley/dplyr).
    • Notebooks and IDEs

      • almond - commit/almond-sh/almond.svg"> - A scala kernel for [Jupyter](https://jupyter.org/).
      • Apache Zeppelin - commit/apache/zeppelin.svg"> - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
      • Polynote - commit/polynote/polynote.svg"> - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from [Netflix](https://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447).
      • Spark Notebook - commit/spark-notebook/spark-notebook.svg"> - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
      • sparkmagic - commit/jupyter-incubator/sparkmagic.svg"> - [Jupyter](https://jupyter.org/) magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through [Livy](https://github.com/cloudera/livy), in Jupyter notebooks.
    • General Purpose Libraries

      • Succinct - commit/amplab/succinct.svg">- Support for efficient queries on compressed data.
      • Apache DataFu - commit/apache/datafu.svg"> - A library of general purpose functions and UDF's.
      • itachi - commit/yaooqinn/itachi.svg"> - A library that brings useful functions from modern database management systems to Apache Spark.
      • spark-daria - commit/mrpowers-io/spark-daria.svg"> - A Scala library with essential Spark functions and extensions to make you more productive.
      • Joblib Apache Spark Backend - commit/joblib/joblib-spark.svg"> - [`joblib`](https://github.com/joblib/joblib) backend for running tasks on Spark clusters.
      • quinn - commit/mrpowers-io/quinn.svg"> - A native PySpark implementation of spark-daria.
    • SQL Data Sources

    • Storage

      • lakeFS - commit/treeverse/lakefs.svg"> - Integration with the lakeFS atomic versioned storage layer.
      • Delta Lake - commit/delta-io/delta.svg"> - Storage layer with ACID transactions.
      • Apache Hudi - commit/apache/hudi.svg"> - Upserts, Deletes And Incremental Processing on Big Data..
      • Apache Iceberg - commit/apache/iceberg.svg"> - Upserts, Deletes And Incremental Processing on Big Data..
    • Graph Processing

      • SparklingGraph - commit/sparkling-graph/sparkling-graph.svg"> - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).
      • neo4j-spark-connector - commit/neo4j-contrib/neo4j-spark-connector.svg"> - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
      • GraphFrames - commit/graphframes/graphframes.svg"> - Data frame based graph API.
    • Machine Learning Extension

      • Apache SystemML - commit/apache/systemml.svg"> - Declarative machine learning framework on top of Spark.
      • Mahout Spark Bindings - linear algebra DSL and optimizer with R-like syntax.
      • KeystoneML - Type safe machine learning pipelines with RDDs.
      • Microsoft ML for Apache Spark - commit/Azure/mmlspark.svg"> - A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
      • MLflow - commit/mlflow/mlflow.svg"> - Machine learning orchestration platform.
    • Utilities

      • Optimus - commit/ironmussa/Optimus.svg"> - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.
    • Streaming

      • Apache Bahir - commit/apache/bahir.svg"> - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).
    • Interfaces

      • Apache Beam - commit/apache/beam.svg"> - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
    • Bioinformatics

      • ADAM - commit/bigdatagenomics/adam.svg"> - Set of tools designed to analyse genomics data.
      • Hail - commit/hail-is/hail.svg"> - Genetic analysis framework.
  • Resources

    • Books

      • Learning Spark, 2nd Edition - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
      • Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: [sryza/aas](https://github.com/sryza/aas).
      • Mastering Apache Spark - Interesting compilation of notes by [Jacek Laskowski](https://github.com/jaceklaskowski). Focused on different aspects of Spark internals.
      • Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to [setup Eclipse for Spark application development](http://freecontent.manning.com/how-to-start-developing-spark-applications-in-eclipse/) and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo [here](https://github.com/spark-in-action/first-edition).
    • Papers

    • MOOCS

      • Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses ([Introduction to Apache Spark](https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x), [Distributed Machine Learning with Apache Spark](https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x), [Big Data Analysis with Apache Spark](https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x), [Advanced Apache Spark for Data Science and Data Engineering](https://www.edx.org/course/advanced-apache-spark-data-science-data-uc-berkeleyx-cs115x), [Advanced Distributed Machine Learning with Apache Spark](https://www.edx.org/course/advanced-distributed-machine-learning-uc-berkeleyx-cs125x)) covering different aspects of software engineering and data science. Python oriented.
      • Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of [Functional Programming in Scala Specialization](https://www.coursera.org/specializations/scala).
    • Workshops

      • AMP Camp - Periodical training event organized by the [UC Berkeley AMPLab](https://amplab.cs.berkeley.edu/). A source of useful exercise and recorded workshops covering different tools from the [Berkeley Data Analytics Stack](https://amplab.cs.berkeley.edu/software/).
    • Projects Using Spark

      • PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
    • Docker Images

    • Miscellaneous

      • Spark with Scala Gitter channel - "_A place to discuss and ask questions about using Scala for Spark programming_" started by [@deanwampler](https://github.com/deanwampler).
      • Apache Spark User List - spark-developers-list.1001551.n3.nabble.com/) - Mailing lists dedicated to usage questions and development topics respectively.