Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/viodotcom/ppca_rs

Python+Rust implementation of the Probabilistic Principal Component Analysis model
https://github.com/viodotcom/ppca_rs

data-science dimensionality-reduction em-algorithm linear-algebra machine-learning machine-learning-algorithms maximum-likelihood maximum-likelihood-estimation missing-data missing-values pca pca-analysis python rust

Last synced: 3 months ago
JSON representation

Python+Rust implementation of the Probabilistic Principal Component Analysis model

Awesome Lists containing this project

README

        

# _Probabilistic_ Principal Component Analysis (PPCA) model

[![PyPI version](https://badge.fury.io/py/ppca-rs.svg)](https://badge.fury.io/py/ppca-rs)
[![Crates.io version](https://img.shields.io/crates/v/ppca)](https://crates.io/crates/ppca)
[![Docs.rs version](https://img.shields.io/docsrs/ppca)](https://docs.rs/ppca)

This project implements a PPCA model implemented in Rust for Python using `pyO3` and `maturin`.

## Installing

This package is available in PyPI!
```bash
pip install ppca-rs
```

And you can also use it natively in Rust:
```bash
cargo add ppca
```

## Why use PPCA?

Glad you asked!

* The PPCA is a simples extension of the PCA (principal component analysis), but can be overall more robust to train.
* The PPCA is a _proper statistical model_. It doesn't spit out only the mean. You get standard deviations, covariances, and all the goodies that come from thre realm of probability and statistics.
* The PPCA model can handle _missing values_. If there is data missing from your dataset, it can extrapolate it with reasonable values and even give you a confidence interval.
* The training converges quickly and will always tend to a global maxima. No metaparameters to dabble with and no local maxima.

## Why use `ppca-rs`?

That's an easy one!

* It's written in Rust, with only a bit of Python glue on top. You can expect a performance in the same leage as of C code.
* It uses `rayon` to paralellize computations evenly across as many CPUs as you have.
* It also uses fancy Linear Algebra Trickery Technology to reduce computational complexity in key bottlenecks.
* Battle-tested at Vio.com with some ridiculously huge datasets.

## Quick example

```python
import numpy as np
from ppca_rs import Dataset, PPCATrainer, PPCA

samples: np.ndarray

# Create your dataset from a rank 2 np.ndarray, where each line is a sample.
# Use non-finite values (`inf`s and `nan`) to signal masked values
dataset = Dataset(samples)

# Train the model (convenient edition!):
model: PPCAModel = PPCATrainer(dataset).train(state_size=10, n_iters=10)

# And now, here is a free sample of what you can do:

# Extrapolates the missing values with the most probable values:
extrapolated: Dataset = model.extrapolate(dataset)

# Smooths (removes noise from) samples and fills in missing values:
extrapolated: Dataset = model.filter_extrapolate(dataset)

# ... go back to numpy:
eextrapolated_np = extrapolated.numpy()

```

## Juicy extras!

* Tired of the linear? We have support for PPCA mixture models. Make the most of your data with clustering and dimensionality reduction in a single tool!
* Support for adaptation of DataFrames using either `pandas` or `polars`. Never juggle those `df`s in your code again.

## Building from soure

### Prerequisites

You will need [Rust](https://rust-lang.org/), which can be installed locally (i.e., without `sudo`) and you will also need `maturin`, which can be installed by
```bash
pip install maturin
```
`pipenv` is also a good idea if you are going to mess around with it locally. At least, you need a `venv` set, otherwise, `maturin` will complain with you.

### Installing it locally

Check the `Makefile` for the available commands (or just type `make`). To install it locally, do
```bash
make install # optional: i=python.version (e.g, `i=3.9`)
```

### Messing around and testing

To mess around, _inside a virtual environment_ (a `Pipfile` is provided for the `pipenv` lovers), do
```bash
maturin develop # use the flag --release to unlock superspeed!
```
This will install the package locally _as is_ from source.

## How do I use this stuff?

See the examples in the `examples` folder. Also, all functions are type hinted and commented. If you are using `pylance` or `mypy`, it should be easy to navigate.

## Is it faster than the pure Python implemetation you made?

You bet!