Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/frjnn/bhtsne

Parallel Barnes-Hut t-SNE implementation written in Rust.
https://github.com/frjnn/bhtsne

barnes-hut bhtsne data-science data-visualization dimensionality-reduction machine-learning rust similarity-measures

Last synced: 4 months ago
JSON representation

Parallel Barnes-Hut t-SNE implementation written in Rust.

Lists

README

        

bhtsne

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Gethseman](https://circleci.com/gh/frjnn/bhtsne.svg?style=shield)](https://app.circleci.com/pipelines/github/frjnn/bhtsne)
[![codecov](https://codecov.io/gh/frjnn/bhtsne/branch/master/graph/badge.svg)](https://codecov.io/gh/frjnn/bhtsne)

Parallel Barnes-Hut and exact implementations of the t-SNE algorithm written in Rust. The tree-accelerated version of the algorithm is described with fine detail in [this paper](http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf) by [Laurens van der Maaten](https://github.com/lvdmaaten). The exact, original, version of the algorithm is described in [this other paper](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) by [G. Hinton](https://www.cs.toronto.edu/~hinton/) and Laurens van der Maaten.
Additional implementations of the algorithm, including this one, are listed at [this page](http://lvdmaaten.github.io/tsne/).

## Installation

Add this line to your `Cargo.toml`:
```toml
[dependencies]
bhtsne = "0.5.2"
```
### Documentation

The API documentation is available [here](https://docs.rs/bhtsne).

### Example

The implementation supports custom data types and custom defined metrics. For instance, general vector data can be handled in the following way.

```rust
use bhtsne;

const N: usize = 150; // Number of vectors to embed.
const D: usize = 4; // The dimensionality of the
// original space.
const THETA: f32 = 0.5; // Parameter used by the Barnes-Hut algorithm.
// Small values improve accuracy but increase complexity.

const PERPLEXITY: f32 = 10.0; // Perplexity of the conditional distribution.
const EPOCHS: usize = 2000; // Number of fitting iterations.
const NO_DIMS: u8 = 2; // Dimensionality of the embedded space.

// Loads the data from a csv file skipping the first row,
// treating it as headers and skipping the 5th column,
// treating it as a class label.
// Do note that you can also switch to f64s for higher precision.
let data: Vec = bhtsne::load_csv("iris.csv", true, Some(&[4]), |float| {
float.parse().unwrap()
})?;
let samples: Vec<&[f32]> = data.chunks(D).collect();
// Executes the Barnes-Hut approximation of the algorithm and writes the embedding to the
// specified csv file.
bhtsne::tSNE::new(&samples)
.embedding_dim(NO_DIMS)
.perplexity(PERPLEXITY)
.epochs(EPOCHS)
.barnes_hut(THETA, |sample_a, sample_b| {
sample_a
.iter()
.zip(sample_b.iter())
.map(|(a, b)| (a - b).powi(2))
.sum::()
.sqrt()
})
.write_csv("iris_embedding.csv")?;
```

In the example euclidean distance is used, but any other distance metric on data types of choice, such as strings, can be defined and plugged in.

## Parallelism
Being built on [rayon](https://github.com/rayon-rs/rayon), the algorithm uses the same number of threads as the number of CPUs available. Do note that on systems with hyperthreading enabled this equals the number of logical cores and not the physical ones. See [rayon's FAQs](https://github.com/rayon-rs/rayon/blob/master/FAQ.md) for additional informations.

## MNIST embedding
The following embedding has been obtained by preprocessing the [MNIST](https://git-disl.github.io/GTDLBench/datasets/mnist_datasets/) train set using PCA to reduce its
dimensionality to 50. It took approximately **3 minutes and 6 seconds** on a 2.0GHz quad-core 10th-generation i5 MacBook Pro.
![mnist](imgs/mnist_embedding.png)