Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/paul-english/spark-mapper
Spark based implementation of the Topological Mapper algorithm
https://github.com/paul-english/spark-mapper
spark topological-data-analysis topology
Last synced: about 2 months ago
JSON representation
Spark based implementation of the Topological Mapper algorithm
- Host: GitHub
- URL: https://github.com/paul-english/spark-mapper
- Owner: paul-english
- License: apache-2.0
- Created: 2017-05-01T00:16:34.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2017-05-16T16:56:25.000Z (about 7 years ago)
- Last Synced: 2024-04-15T02:52:49.784Z (2 months ago)
- Topics: spark, topological-data-analysis, topology
- Language: Scala
- Homepage: https://log0ymxm.github.io/spark-mapper/scaladoc-latest
- Size: 708 KB
- Stars: 14
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-TDA - Spark Mapper - Estimating a lower dimensional simplicial complex from a dataset. (Frameworks and Libs / Spark)
README
# Spark Mapper
[![Build status](https://api.travis-ci.org/log0ymxm/spark-mapper.svg?branch=master)](https://travis-ci.org/log0ymxm/spark-mapper)
[![codecov](https://codecov.io/gh/log0ymxm/spark-mapper/branch/master/graph/badge.svg)](https://codecov.io/gh/log0ymxm/spark-mapper)
[![Maven Central](https://img.shields.io/maven-central/v/com.github.log0ymxm/spark-mapper_2.11.svg)](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.github.log0ymxm%22%20AND%20a%3A%22spark-mapper_2.11%22)Mapper is a topological data anlysis technique for estimating a lower dimensional simplicial complex from a dataset. It was initially described in the paper "Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition." [1]
Concentric Circles | MNIST Twos
:-------------------------:|:-------------------------:
![Concentric circles](https://github.com/log0ymxm/spark-mapper/raw/master/examples/concentric_circles.png) | ![MNIST](https://github.com/log0ymxm/spark-mapper/raw/master/examples/mnist_twos.png)# Things to do
- [ ] Improve the handling of pairwise distances. This is likely the largest bottleneck for large datasets.
- [ ] Implement some useful filter functions: Gaussian Density, Graph Laplacian, etc
- [ ] Implement different methods for choosing cluster cutoff. There's a few simple ones we can try, and the scale graph idea.
- [ ] Explore using a distributed clustering algorithm. Currently clustering is local for each cover segment, which means that as data grows you need to increase the cover intervals proportionally to keep the partitions within memory. A distributed cluster would remove this requirement.# Related Software
- [Python Mapper](http://danifold.net/mapper/index.html)
- [TDA Mapper (R)](https://github.com/paultpearson/TDAmapper/)# References
1. G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.
2. Daniel Müllner and Aravindakshan Babu, Python Mapper: An open-source toolchain for data exploration, analysis and visualization, 2013, URL http://danifold.net/mapper