Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/eugenehp/fast-umap

UMAP (Uniform Manifold Approximation and Projection) in Rust
https://github.com/eugenehp/fast-umap
dimensionality-reduction gpu rust umap
Last synced: 19 days ago
JSON representation
UMAP (Uniform Manifold Approximation and Projection) in Rust
Host: GitHub
URL: https://github.com/eugenehp/fast-umap
Owner: eugenehp
License: mit
Created: 2024-12-15T04:02:29.000Z (19 days ago)
Default Branch: master
Last Pushed: 2024-12-15T04:23:21.000Z (19 days ago)
Last Synced: 2024-12-15T04:35:52.323Z (19 days ago)
Topics: dimensionality-reduction, gpu, rust, umap
Language: Rust
Homepage: https://crates.io/crates/fast-umap
Size: 0 Bytes
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # fast-umap

UMAP (Uniform Manifold Approximation and Projection) in Rust

This repository contains a Rust implementation of **UMAP** (Uniform Manifold Approximation and Projection), a dimensionality reduction algorithm that preserves both the local and global structure of data. UMAP is widely used for visualizing high-dimensional data in 2D or 3D space.

This implementation leverages the [burn](https://github.com/tracel-ai/burn) machine learning framework, which provides automatic differentiation and GPU support, allowing you to train and apply UMAP models on high-dimensional datasets efficiently.

## Features

- **Dimensionality Reduction**: Reduces high-dimensional data to a lower-dimensional space (e.g., 2D or 3D) for visualization or further analysis.

- **Customizable UMAP Model**: The model architecture can be configured with different numbers of input features, hidden layer sizes, and output dimensions.

- **GPU Support**: Powered by the `burn` framework with support for training on CPU and GPU using the `wgpu` backend.

- **Flexible Data Handling**: Functions for converting between vectors and tensors, generating synthetic data, and more.

## Roadmap

- [ ] Add `k-neighbors` element and add precalculated KNN tree to do faster calculations on high dimensional data.

- [ ] Add MNIST dataset example

- [ ] Add [different distance functions](https://github.com/eugenehp/fast-distances), not just euclidean

- [ ] Create testbench to compare different hyper parameters (`patience` vs `n_features` vs `epochs` etc.)

## Installation

```shell

cargo add fast-umap

```

## Example Usage

### 1. Fitting a UMAP Model

You can fit a UMAP model to your data using the `UMAP::fit` function. Here's how to do it:

```rust

use burn::backend::Autodiff;

use burn::backend::wgpu::{Wgpu, WgpuDevice};

use fast_umap::prelude::*;

fn main() {

    // Example data (list of samples, each with a list of features)

    let data: Vec> = vec![

        vec![1.0, 2.0, 3.0],

        vec![4.0, 5.0, 6.0],

        vec![7.0, 8.0, 9.0],

        // Add more samples...

    ];

    // Fit the UMAP model

    let model = umap(data);

    // You can now use the model to transform new data

    let transformed = model.transform(data);

    // Print the transformed data (low-dimensional representation)

    for sample in transformed {

        println!("{:?}", sample);

    }

}

```

### 2. Transforming Data

Once the UMAP model is trained, you can transform new high-dimensional data into its low-dimensional representation:

```rust

let transformed_data = model.transform(new_data);

```

This function will take `new_data` in the form of `Vec>` and return its 2D or 3D representation, suitable for visualization.

### 3. Generating Test Data

You can generate synthetic test data to experiment with the UMAP model using the `generate_test_data` function:

```rust

let data = generate_test_data(100, 50); // 100 samples, each with 50 features

```

### 4. Visualizing Data

After transforming the data to a 2D or 3D space, you can use external charting libraries (e.g., `plotters` in Rust or `matplotlib` in Python) to visualize the results.

## Model Configuration

The UMAP model configuration is customizable through the `UMAPModelConfigBuilder`. You can set the following parameters:

- **input_size**: Number of input features (i.e., the dimensionality of the data).

- **hidden_size**: The number of neurons in the hidden layers.

- **output_size**: The target number of dimensions (typically 2 or 3 for visualization).

Here's how to configure and build the model:

```rust

let model_config = UMAPModelConfigBuilder::default()

    .input_size(50)      // Input features: 50 dimensions

    .hidden_size(100)    // Hidden layer size: 100 neurons

    .output_size(2)      // Output size: 2 (for 2D visualization)

    .build()

    .unwrap();

```

## Training the UMAP Model

You can train the UMAP model on your dataset by calling the `fit` method on the `UMAP` struct. The training process will optimize the model's weights to reduce the data's dimensionality.

Training configuration parameters include:

- **epochs**: The number of epochs to train the model.

- **batch_size**: The number of samples per training batch.

- **learning_rate**: The learning rate for gradient descent.

- **beta1**, **beta2**: Optimizer hyperparameters for Adam optimization.

For example:

```

let model = UMAP::>::fit(data, WgpuDevice::default());

```

## Examples

### Simple

```shell

cargo run --example simple

```

Sample code:

```rust

use fast_umap::prelude::*;

use rand::Rng;

fn main() {

    // Number of samples in the dataset

    let num_samples = 100;

    // Number of features (dimensions) for each sample

    let num_features = 3;

    // Create a random number generator for generating random values

    let mut rng = rand::thread_rng();

    // Generate a dataset of random values with `num_samples` rows and `num_features` columns

    let data: Vec> = (0..num_samples * num_features)

        .map(|_| rng.gen::()) // Random number generation for each feature

        .collect::>() // Collect all random values into a vector

        .chunks_exact(num_features) // Chunk the vector into rows of length `num_features`

        .map(|chunk| chunk.to_vec()) // Convert each chunk into a Vec

        .collect(); // Collect the rows into a vector of vectors

    // Fit the UMAP model to the data and reduce the data to a lower-dimensional space (default: 2D)

    let umap = umap(data.clone());

    // Transform the data using the trained UMAP model to reduce its dimensions

    let reduced_dimensions_vector = umap.transform(data.clone());

    // Visualize the reduced dimensions as a vector

    chart_vector(reduced_dimensions_vector, None);

    // Optionally, you can also visualize the reduced dimensions as a tensor

    // let reduced_dimensions_tensor = umap.transform_to_tensor(data.clone());

    // print_tensor_with_title("reduced_dimensions", &reduced_dimensions_tensor);

    // chart_tensor(reduced_dimensions_tensor, None);

}

```

Generates this plot:

![plot](./docs/plot.png)

### Advanced

```shell

cargo run --example advanced

```

Sample code:

```rust

use burn::module::AutodiffModule;

#[allow(unused_imports)]

use burn::{

    backend::{Autodiff, Wgpu},

    prelude::Backend,

    prelude::*,

    tensor::{Device, Tensor},

};

use fast_umap::{

    chart,

    model::{UMAPModel, UMAPModelConfigBuilder},

    train::{train, TrainingConfig},

    utils::*,

};

fn main() {

    // Define the custom backend type using Wgpu with specific precision (f32) and integer type (i32)

    type MyBackend = Wgpu;

    // Define the AutodiffBackend based on the custom MyBackend type

    type MyAutodiffBackend = Autodiff;

    // Initialize the device (GPU) for computation

    let device = burn::backend::wgpu::WgpuDevice::default();

    // Set the training parameters

    let batch_size = 1; // Batch size for training

    let num_samples = 1000; // Number of samples in the dataset

    let num_features = 100; // Number of features (dimensions) for each sample

    let output_size = 2; // Number of output dimensions (e.g., 2 for 2D embeddings)

    let hidden_size = 100; // Size of the hidden layer in the neural network

    let learning_rate = 0.001; // Learning rate for optimization

    let beta1 = 0.9; // Beta1 parameter for Adam optimizer

    let beta2 = 0.999; // Beta2 parameter for Adam optimizer

    let epochs = 400; // Number of training epochs

    let seed = 9999; // Random seed for reproducibility

    let verbose = true; // Enables the progress bar for training

    let patience = 30; // Number of epochs with no improvement before stopping early

    // Seed the random number generator for reproducibility

    MyBackend::seed(seed);

    // Generate random test data for training

    let train_data = generate_test_data(num_samples, num_features);

    // Configure the model by setting input size, hidden size, and output size

    let model_config = UMAPModelConfigBuilder::default()

        .input_size(num_features)

        .hidden_size(hidden_size)

        .output_size(output_size)

        .build()

        .unwrap();

    // Initialize the UMAP model with the specified configuration and device

    let model: UMAPModel = UMAPModel::new(&model_config, &device);

    // Set up the training configuration with the specified parameters

    let config = TrainingConfig::::builder()

        .with_epochs(epochs) // Set the number of training epochs

        .with_batch_size(batch_size) // Set the batch size

        .with_learning_rate(learning_rate) // Set the learning rate

        .with_device(device) // Set the computation device (GPU)

        .with_beta1(beta1) // Set the beta1 parameter for Adam optimizer

        .with_beta2(beta2) // Set the beta2 parameter for Adam optimizer

        .with_verbose(verbose) // Enable or disable the progress bar

        .with_patience(patience) // Set the patience for early stopping

        .build()

        .expect("Failed to build TrainingConfig");

    // Start training the model with the training data and configuration

    let model = train::(

        model,              // The model to train

        num_samples,        // Number of samples in the dataset

        num_features,       // Number of features per sample

        train_data.clone(), // The training data

        &config,            // The training configuration

    );

    // Validate the trained model

    let model = model.valid();

    // Convert the training data into a tensor for input to the model

    let global = convert_vector_to_tensor(train_data, num_samples, num_features, &config.device);

    // Perform the forward pass to get the low-dimensional (local) representation

    let local = model.forward(global.clone());

    // Optionally, print the global and local tensors for inspection (commented-out for now)

    // if verbose {

    //     print_tensor_with_title("global", &global);

    //     print_tensor_with_title("local", &local);

    // }

    // Visualize the reduced dimensions (2D embedding) using a chart

    chart::chart_tensor(local, None);

}

```

It also generates 2d plot, and a loss chart:

![loss](./docs/losses.png)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Copyright

2024, Eugene Hauptmann

## Thank you

Inspired by original UMAP [paper](https://arxiv.org/abs/1802.03426)