https://github.com/zachgrayio/rank1_covariance

incremental (rank-1) covariance matrix calculation on the GPU
https://github.com/zachgrayio/rank1_covariance

Last synced: 12 months ago
JSON representation

incremental (rank-1) covariance matrix calculation on the GPU

Host: GitHub
URL: https://github.com/zachgrayio/rank1_covariance
Owner: zachgrayio
Created: 2024-08-09T07:46:27.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-08-20T16:13:31.000Z (almost 2 years ago)
Last Synced: 2025-03-30T01:36:59.690Z (over 1 year ago)
Language: Cuda
Homepage:
Size: 79.1 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # rank-1 covariance matrix calculation

Within this repo there are 2 simple programs that calculate covariances for a set of inputs on the GPU. 

Each program calculates these matrices either in the simple, traditional manner, or in an "incremental" or "rank-1" mode

where _only_ the latest row included in the calculation is factored in as it appears, leading to large performance gains

over a traditional matmul based approach.

- A Python reference implementation: [main.py](main.py)

  - PyTorch + CUDA if available

  - also uses `numpy`'s `.cov()` for a simple sanity check

- A CUDA C++ binary [main.cu](main.cu)

  - includes a custom rank1 kernel

  - also a "1shot" covariance function that does roughly the same as `.cov` AFAIK

    - makes use of cuBLAS's `cublasSgemm_v2` and a few small kernels to speed up taking col means and centering

  - test coverage to ensure the results are correct

  - a comparison function that compares timing of the 2 approaches

## Results on my RTX 3080 16Gi

With the 1shot work factor set to 1:

```text

...

total time for rank1 update: 21.4057 seconds

total time for 1shot update: 83.4697 seconds

iterations per second for rank1 update: 117.7259 iterations/second

iterations per second for 1shot update: 30.1906 iterations/second

```

But note that for some domains, the work factor for this function may be lower (closer to `0.5`), but this is domain and application specific.

## Run it yourself

### Requirements

Requires NVIDIA hardware of course as well as a working CUDA setup but edit `CMakeLists.txt` as needed if it doesn't work

with your setup. Python code should be portable enough to run anywhere, if you have `torch` and `numpy` present.

### Running

To run everything:

```bash

make

```

For advanced usage, see the [Makefile](Makefile).

## Profiling

I haven't had a chance to profile this yet, there's certainly a few bottlenecks to chase down.

## WTF is Rank-1?

A rank-1 update in linear algebra refers to the process of modifying a matrix by adding or subtracting a matrix that can be expressed as the outer product of two vectors.

Specifically, if you have an existing matrix $A$, a rank-1 update to this matrix would look like:

$A' = A + u \cdot v^T$

where $u$ and $v$ are column vectors, and $v^T$ is the transpose of $v$. The product $u \cdot v^T$ results in a matrix 

of the same dimensions as $A$, but it is a rank-1 matrix because it is formed by the outer product of two vectors.

This operation is called a rank-1 update because it increases (or decreases) the rank of the matrix by at most 1.

### How that is applied in these 2 programs

In the context of calculating a covariance matrix, a rank-1 update is particularly useful when you have a sequence of data points arriving one at a time, and you want to update the covariance matrix incrementally without recalculating it from scratch.

Given a covariance matrix $\Sigma$, a new data point $x_{\text{new}}$, and the current mean vector $\mu$ based on $n-1$ samples, the rank-1 update can be mathematically described as follows:

1. **Calculate the new mean**:

  

   $\mu_{\text{new}} = \mu + \frac{x_{\text{new}} - \mu}{n}$

2. **Compute the deviation vectors**:

   

    $\delta_{\text{old}} = x_{\text{new}} - \mu$

    $\delta_{\text{new}} = x_{\text{new}} - \mu_{\text{new}}$

3. **Update the covariance matrix**:

   $\Sigma_{\text{new}} = \frac{n-2}{n-1} \Sigma + \frac{\delta_{\text{old}} \cdot \delta_{\text{new}}^T}{n-1}$

This update rule efficiently incorporates the new data point into the existing covariance matrix $\Sigma$ by adjusting it with the outer product of the deviation vectors $\delta_{\text{old}}$ and $\delta_{\text{new}}$.

### In practice

Updating a covariance matrix with a new row in this case is as simple as the following, taken directly from the python reference impl:

```python

def rank1_update(cov_matrix, new_row, mean, n):

    # calculate the new mean

    new_mean = mean + (new_row - mean) / n

    # update the covariance matrix

    delta_old = new_row - mean

    delta_new = new_row - new_mean

    cov_matrix = ((n - 2) / (n - 1)) * cov_matrix + torch.outer(delta_old, delta_new) / (n - 1)

    return cov_matrix, new_mean

```

and while it's a bit more noisy, the CUDA kernel bears a lot of resemblance:

```c++

__global__ void rank1_update_kernel(float* d_cov_matrix, const float* d_new_row, float* d_mean, int cols, int n) {

    const int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx >= cols) return;

    const float new_mean = fmaf(d_new_row[idx] - d_mean[idx], 1.0f / n, d_mean[idx]);

    // update the covariance matrix

    const int row_start = idx * cols;

    for (int j = 0; j < cols; ++j) {

        const float delta_old = d_new_row[j] - d_mean[j];

        const float delta_new = d_new_row[j] - new_mean;

        d_cov_matrix[row_start + j] = fmaf(delta_old, delta_new / (n - 1), (n - 2) / static_cast(n - 1) * d_cov_matrix[row_start + j]);

    }

    d_mean[idx] = new_mean;

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zachgrayio/rank1_covariance

Awesome Lists containing this project

README