https://github.com/pair-code/understanding-umap

Understanding the theory behind UMAP
https://github.com/pair-code/understanding-umap

Last synced: 4 months ago
JSON representation

Understanding the theory behind UMAP

Host: GitHub
URL: https://github.com/pair-code/understanding-umap
Owner: PAIR-code
License: apache-2.0
Created: 2019-11-04T16:14:30.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-11-20T12:18:11.000Z (about 1 year ago)
Last Synced: 2025-06-09T03:05:07.179Z (8 months ago)
Language: JavaScript
Homepage: https://pair-code.github.io/understanding-umap
Size: 33.1 MB
Stars: 173
Watchers: 3
Forks: 25
Open Issues: 7
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# Understanding UMAP

Dimensionality reduction is a powerful tool for machine learning practitioners to visualize and understand large, high dimensional datasets. One of the most widely used techniques for visualization is [t-SNE](https://lvdmaaten.github.io/tsne/), but its performance suffers with large datasets and using it correctly can be [challenging](https://distill.pub/2016/misread-tsne/).

[UMAP](https://github.com/lmcinnes/umap) is a new technique by McInnes et al. that offers a number of advantages over t-SNE, most notably increased speed and better preservation of the data's global structure. In this article, we'll take a look at the theory behind UMAP in order to better understand how the algorithm works, how to use it effectively, and how its performance compares with t-SNE.

```bash
yarn
yarn dev
```

#### Publishing to github pages

```bash
yarn pub
```

#### To develop figures individually

```bash
yarn dev:cech
yarn dev:hyperparameters
yarn dev:mammoth-umap
yarn dev:mammoth-tsne
yarn dev:supplement
yarn dev:toy
yarn dev:toy_comparison
```

#### Data preprocessing

For the mammoth figures, the [raw 3D data](https://github.com/MNoichl/UMAP-examples-mammoth-/blob/master/mammoth_a.csv) was downsampled to 50,000 points before being projected with UMAP / t-SNE. These 50,000 points were then randomly subsampled to 10,000 points in order to minimize the payload size.

_Understanding UMAP_ uses a few tricks to make the data payloads for some of the interactive figures small enough to download in a reasonable time. The `mammoth` figures use a 10-bit encoding scheme to compress the 10,000 data points into a significantly smaller payload. The `hyperparameters` and `toy_comparison` figures precompute UMAP embeddings for all of their different combinations, then use the same 10-bit encoding scheme to compress the data.

```bash
yarn preprocess:hyperparameters
yarn preprocess:mammoth
yarn preprocess:toy_comparison
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pair-code/understanding-umap

Awesome Lists containing this project

README