Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ognis1205/spark-tda

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.
https://github.com/ognis1205/spark-tda

apache-spark machine-learning ml mllib spark tda topological-data-analysis

Last synced: 2 months ago
JSON representation

SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.

Host: GitHub
URL: https://github.com/ognis1205/spark-tda
Owner: ognis1205
License: apache-2.0
Created: 2018-03-19T13:18:23.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2018-07-08T05:41:48.000Z (almost 6 years ago)
Last Synced: 2024-04-17T00:01:34.552Z (2 months ago)
Topics: apache-spark, machine-learning, ml, mllib, spark, tda, topological-data-analysis
Language: Scala
Size: 31.4 MB
Stars: 47
Watchers: 5
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-TDA - Spark TDA - Scalable topological data analysis package. (Frameworks and Libs / Spark)

README

        # SparkTDA

[![Build Status](https://travis-ci.org/ognis1205/spark-tda.svg?branch=master)](https://travis-ci.org/ognis1205/spark-tda)

[![codecov.io](https://codecov.io/gh/ognis1205/spark-tda/coverage.svg?branch=master)](https://codecov.io/gh/ognis1205/spark-tda?branch=master)

[![Join the chat at https://gitter.im/ognis1205/spark-tda](https://badges.gitter.im/ognis1205/spark-tda.svg)](https://gitter.im/ognis1205/spark-tda?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

The scalable topological data analysis package for [Apache Spark](http://spark.apache.org/). This project aims to

implement the following features:

- [x] [Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves](https://github.com/ognis1205/spark-tda/wiki/Mapper)

- [x] [Scalable Mapper Implementation](https://github.com/ognis1205/spark-tda/wiki/Mapper)

- [ ] Scalable Multiscale Mapper Implementation

- [ ] Scalable Tower Computation for Multiscale Mapper

- [ ] Scalable Persistent Homology Computation on Top of Apache Spark

If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.

# Status

**WIP** and **EXPERIMENTAL**. This package is still a proof-of-concept of scalable topological data analysis support for

Apache Spark, hence you cannot expect that this package is ready for production use.

# Examples

### Mapper

2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap) | 2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)

:--------------------------------------------------------------------:|:-------------------------------------------------------------------:

60k images clustered in 784 dimensions without any projection loss | 60k images clustered in 784 dimensions witout any projection loss

![](./data/mnist/mnist-k20-s40-l0.5-c0.5-i0.05.png)           | ![](./data/mnist/mnist-k20-s20-l0.5-c0.5-i0.05.png)

# Requirements

This library requires Spark 2.0+

# Building and Running Unit Tests

To compile this project, run `sbt package` from the project home directory. This will also run the Scala unit tests.

To run the unit tests, run `sbt test` from the project home directory. This project uses the

[sbt-spark-package](https://github.com/databricks/sbt-spark-package) plugin, which provides the 'spPublish' and

'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by

supplying a comma-delimited list of Maven coordinates with `--packages` and download the package from the locally

repository or official [Spark Packages](https://spark-packages.org/package/ognis1205/spark-tda) repository.

### The package can be published locally with:

```bash

$ sbt spPublishLocal

```

### The package can be published to [Spark Packages](https://spark-packages.org/package/ognis1205/spark-tda) with (requires authentication and authorization):

```bash

$ sbt spPublish

```

# Using with Spark Shell

This package can be added to Spark using the `--packages` command line option. For example, to include it when starting

the spark shell:

```bash

$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11

```

# Future Works

### Mapper

- [ ] Write Wiki

- [ ] Implement Python APIs

- [ ] Publish to [Spark Packages](https://spark-packages.org/package/ognis1205/spark-tda)

- [ ] Benchmark

- [ ] Consider using [GraphFrames](https://github.com/graphframes/graphframes) instead of plain GraphX

- [ ] Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers

# Related Softwares & Projects

1. [Python Mapper](http://danifold.net/mapper/index.html)

2. [TDAMapper (R)](https://github.com/paultpearson/TDAmapper/)

3. [Spark Mapper (Spark)](https://github.com/log0ymxm/spark-mapper)

4. [KeplerMapper (Python with GUI)](https://github.com/MLWave/kepler-mapper)

# References

### Mapper

1. [G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.](https://research.math.osu.edu/tgda/mapperPBG.pdf)

2. [J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013](https://arxiv.org/abs/1303.3255)

3. [T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015](https://arxiv.org/abs/1504.03763)

4. [E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015](https://arxiv.org/abs/1512.04108)

5. [E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.](https://www.cse.buffalo.edu/fwcg2015/assets/pdf/FWCG_2015_paper_2.pdf)

6. [H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.](http://www.bigdata.uni-frankfurt.de/wp-content/uploads/2015/10/Evaluating-Ayasdi’s-Topological-Data-Analysis-For-Big-Data_HKim2015.pdf)

### KNN/ANN/SNN

1. [L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.](http://www.cs.cmu.edu/~agray/approxnn.pdf)

2. [L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.](https://ieeexplore.ieee.org/document/4118757/)

3. [D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629 ](https://dl.acm.org/citation.cfm?id=1219917)

4. [M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309](https://www-users.cs.umn.edu/~kumar001/papers/high_dim_clustering_19.pdf)

5. [L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.](https://www-users.cs.umn.edu/~kumar001/papers/SIAM_snn.pdf)

6. [M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.](https://imada.sdu.dk/~zimek/publications/SSDBM2010/SNN-SSDBM2010-preprint.pdf)

### LSH

1. [M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.](http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf)