{"id":33167169,"url":"https://github.com/ognis1205/spark-tda","last_synced_at":"2025-11-20T18:01:39.121Z","repository":{"id":127457490,"uuid":"125860501","full_name":"ognis1205/spark-tda","owner":"ognis1205","description":"SparkTDA is a package for Apache Spark providing Topological Data Analysis Functionalities.","archived":false,"fork":false,"pushed_at":"2018-07-08T05:41:48.000Z","size":32972,"stargazers_count":47,"open_issues_count":2,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-04-20T07:35:03.487Z","etag":null,"topics":["apache-spark","machine-learning","ml","mllib","spark","tda","topological-data-analysis"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ognis1205.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-03-19T13:18:23.000Z","updated_at":"2024-02-12T18:12:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"d3d30dce-2955-4492-9d93-cf08fddc4865","html_url":"https://github.com/ognis1205/spark-tda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ognis1205/spark-tda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ognis1205%2Fspark-tda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ognis1205%2Fspark-tda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ognis1205%2Fspark-tda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ognis1205%2Fspark-tda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ognis1205","download_url":"https://codeload.github.com/ognis1205/spark-tda/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ognis1205%2Fspark-tda/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285484486,"owners_count":27179744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-20T02:00:05.334Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","machine-learning","ml","mllib","spark","tda","topological-data-analysis"],"created_at":"2025-11-16T00:00:44.022Z","updated_at":"2025-11-20T18:01:39.116Z","avatar_url":"https://github.com/ognis1205.png","language":"Scala","funding_links":[],"categories":["Frameworks and Libs"],"sub_categories":["Spark"],"readme":"# SparkTDA\n\n[![Build Status](https://travis-ci.org/ognis1205/spark-tda.svg?branch=master)](https://travis-ci.org/ognis1205/spark-tda)\n[![codecov.io](https://codecov.io/gh/ognis1205/spark-tda/coverage.svg?branch=master)](https://codecov.io/gh/ognis1205/spark-tda?branch=master)\n[![Join the chat at https://gitter.im/ognis1205/spark-tda](https://badges.gitter.im/ognis1205/spark-tda.svg)](https://gitter.im/ognis1205/spark-tda?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\nThe scalable topological data analysis package for [Apache Spark](http://spark.apache.org/). This project aims to\nimplement the following features:\n\n- [x] [Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves](https://github.com/ognis1205/spark-tda/wiki/Mapper)\n- [x] [Scalable Mapper Implementation](https://github.com/ognis1205/spark-tda/wiki/Mapper)\n- [ ] Scalable Multiscale Mapper Implementation\n- [ ] Scalable Tower Computation for Multiscale Mapper\n- [ ] Scalable Persistent Homology Computation on Top of Apache Spark\n\nIf you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.\n\n# Status\n\n**WIP** and **EXPERIMENTAL**. This package is still a proof-of-concept of scalable topological data analysis support for\nApache Spark, hence you cannot expect that this package is ready for production use.\n\n# Examples\n\n### Mapper\n\n2-skeltons of Reeb Diagram of MNIST (40 intervals on the 1st primcipal component with 50% overlap) | 2-skeltons of Reeb Diagram of MNIST (20 intervals on the 1st primcipal component with 50% overlap)\n:--------------------------------------------------------------------:|:-------------------------------------------------------------------:\n60k images clustered in 784 dimensions without any projection loss | 60k images clustered in 784 dimensions witout any projection loss\n![](./data/mnist/mnist-k20-s40-l0.5-c0.5-i0.05.png)           | ![](./data/mnist/mnist-k20-s20-l0.5-c0.5-i0.05.png)\n\n# Requirements\n\nThis library requires Spark 2.0+\n\n# Building and Running Unit Tests\n\nTo compile this project, run `sbt package` from the project home directory. This will also run the Scala unit tests.\nTo run the unit tests, run `sbt test` from the project home directory. This project uses the\n[sbt-spark-package](https://github.com/databricks/sbt-spark-package) plugin, which provides the 'spPublish' and\n'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by\nsupplying a comma-delimited list of Maven coordinates with `--packages` and download the package from the locally\nrepository or official [Spark Packages](https://spark-packages.org/package/ognis1205/spark-tda) repository.\n\n### The package can be published locally with:\n\n```bash\n$ sbt spPublishLocal\n```\n\n### The package can be published to [Spark Packages](https://spark-packages.org/package/ognis1205/spark-tda) with (requires authentication and authorization):\n\n```bash\n$ sbt spPublish\n```\n\n# Using with Spark Shell\n\nThis package can be added to Spark using the `--packages` command line option. For example, to include it when starting\nthe spark shell:\n\n```bash\n$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11\n```\n\n# Future Works\n\n### Mapper\n\n- [ ] Write Wiki\n- [ ] Implement Python APIs\n- [ ] Publish to [Spark Packages](https://spark-packages.org/package/ognis1205/spark-tda)\n- [ ] Benchmark\n- [ ] Consider using [GraphFrames](https://github.com/graphframes/graphframes) instead of plain GraphX\n- [ ] Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers\n\n# Related Softwares \u0026 Projects\n\n1. [Python Mapper](http://danifold.net/mapper/index.html)\n2. [TDAMapper (R)](https://github.com/paultpearson/TDAmapper/)\n3. [Spark Mapper (Spark)](https://github.com/log0ymxm/spark-mapper)\n4. [KeplerMapper (Python with GUI)](https://github.com/MLWave/kepler-mapper)\n\n# References\n\n### Mapper\n\n1. [G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.](https://research.math.osu.edu/tgda/mapperPBG.pdf)\n2. [J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013](https://arxiv.org/abs/1303.3255)\n3. [T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015](https://arxiv.org/abs/1504.03763)\n4. [E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015](https://arxiv.org/abs/1512.04108)\n5. [E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.](https://www.cse.buffalo.edu/fwcg2015/assets/pdf/FWCG_2015_paper_2.pdf)\n6. [H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.](http://www.bigdata.uni-frankfurt.de/wp-content/uploads/2015/10/Evaluating-Ayasdi’s-Topological-Data-Analysis-For-Big-Data_HKim2015.pdf)\n\n### KNN/ANN/SNN\n\n1. [L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.](http://www.cs.cmu.edu/~agray/approxnn.pdf)\n2. [L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.](https://ieeexplore.ieee.org/document/4118757/)\n3. [D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629 ](https://dl.acm.org/citation.cfm?id=1219917)\n4. [M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309](https://www-users.cs.umn.edu/~kumar001/papers/high_dim_clustering_19.pdf)\n5. [L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.](https://www-users.cs.umn.edu/~kumar001/papers/SIAM_snn.pdf)\n6. [M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.](https://imada.sdu.dk/~zimek/publications/SSDBM2010/SNN-SSDBM2010-preprint.pdf)\n\n### LSH\n\n1. [M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms, 34th STOC, 2002.](http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fognis1205%2Fspark-tda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fognis1205%2Fspark-tda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fognis1205%2Fspark-tda/lists"}