https://github.com/gpuengineering/gputils

A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)
https://github.com/gpuengineering/gputils
cplusplus-17 cplusplus-20 cpp cuda cuda-c cuda-cpp cuda-programming header-only linear-algebra
Last synced: 5 months ago
JSON representation
A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)
Host: GitHub
URL: https://github.com/gpuengineering/gputils
Owner: GPUEngineering
License: gpl-3.0
Created: 2024-04-10T18:52:56.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-28T22:53:57.000Z (9 months ago)
Last Synced: 2025-07-30T10:09:41.316Z (5 months ago)
Topics: cplusplus-17, cplusplus-20, cpp, cuda, cuda-c, cuda-cpp, cuda-programming, header-only, linear-algebra
Language: Cuda
Homepage:
Size: 401 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project

README

          # GPUtils

## 1. DTensor

The `DTensor` class is for manipulating data on a GPU. 

It manages their memory and facilitates various algebraic operations.

A tensor has three axes: `[rows (m) x columns (n) x matrices (k)]`.

An (m,n,1)-tensor stores a _matrix_, and an (m,1,1)-tensor stores a _vector_.

We first need to decide on a data type between `float` or `double`.

We will use `float` in the following examples.

### 1.1. Vectors

The simplest way to create an empty `DTensor` object is by constructing a vector:

```c++

size_t n = 100;

DTensor myTensor(n);

```

> [!IMPORTANT] 

> This creates an n-dimensional vector as an (n,1,1)-tensor on the device.

A `DTensor` can be instantiated from host memory:

```c++

std::vector h_a{4., -5., 6., 9., 8., 5., 9., -10.2, 9., 11.};

DTensor myTensor(h_a, h_a.size());

std::cout << myTensor << "\n";

```

> [!CAUTION]

> Printing a `DTensor` to `std::cout` will slow down your program

> (it requires the data to be downloaded from the device).

> Printing was designed for quick debugging.

We will often need to create slices (or shallow copies) of a `DTensor` 

given a range of values. We can then do:

```c++

size_t axis = 0;  // rows=0, cols=1, mats=2

size_t from = 3;

size_t to = 5;

DTensor mySlice(myTensor, axis, from, to);

std::cout << mySlice << "\n";

```

Sometimes we need to reuse an already allocated `DTensor` by uploading 

new data from the host by using the method `upload`. Here is a short example:

```c++

std::vector h_a{1., 2., 3.};  // host data a

DTensor myVec(h_a, 3);  // create vector in tensor on device

std::vector h_b{4., -5., 6.};  // host data b

myVec.upload(h_b);

std::cout << myVec << "\n";

```

We can upload some host data to a particular position of a `DTensor` as follows:

```c++

std::vector hostData{1., 2., 3.};

// here, `true` tells the constructor to set all allocated elements to zero

DTensor x(7, 1, 1, true);  // x = [0, 0, 0, 0, 0, 0, 0]'

DTensor mySlice(x, 0, 3, 5); 

mySlice.upload(hostData);

std::cout << x << "\n";  // x = [0, 0, 0, 1, 2, 3, 0]'

```

If necessary, the data can be downloaded from the device to the host using 

`download`.

Very often we will also need to copy data from an existing `DTensor`

to another `DTensor` (without passing through the host).

To do this we can use `deviceCopyTo`. Here is an example:

```c++

DTensor x(10);

DTensor y(10);

x.deviceCopyTo(y);  // x ---> y (device memory to device memory)

```

The copy constructor has also been implemented; to hard-copy a `DTensor` just

do `DTensor myCopy(existingTensor)`.

Lastly, a not so efficient method that should only be used for 

debugging, if at all, is the `()` operator (e.g., `x(i, j, k)`), which fetches

one element of the `DTensor` to the host.

This cannot be used to set a value, so don't do anything like `x(0, 0, 0) = 4.5`!

> [!CAUTION]

> For the love of god, do not put this `()` operator in a loop.

### 1.2. Computation of scalar quantities

The following scalar quantities can be computed (internally, 

we use `cublas` functions):

- `.normF()`: the Frobenius norm of a tensor $x$, using `nrm2` (i.e., the 2-norm, or Euclidean norm, if $x$ is a vector)

- `.sumAbs()`: the sum of the absolute of all the elements, using `asum` (i.e., the 1-norm if $x$ is a vector)

### 1.3. Some cool operators

We can element-wise add `DTensor`s on the device as follows:

```c++

std::vector host_x{1., 2., 3., 4., 5., 6.,  7.};

std::vector host_y{1., 3., 5., 7., 9., 11., 13.};

DTensor x(host_x, host_x.size());

DTensor y(host_y, host_y.size());

x += y;  // x = [2, 5, 8, 11, 14, 17, 20]'

std::cout << x << "\n";

```

To element-wise subtract `y` from `x` we can use `x -= y`.

We can also scale a `DTensor` by a scalar with `*=` (e.g, `x *= 5.0f`). 

To negate the values of a `DTensor` we can do `x *= -1.0f`.

We can also compute the inner product (as a (1,1,1)-tensor) of two vectors as follows:

```c++

std::vector host_x{1., 2., 3., 4., 5., 6.,  7.};

std::vector host_y{1., 3., 5., 7., 9., 11., 13.};

DTensor xtr(host_x, 1, host_x.size());  // column vector

DTensor y(host_y, host_y.size());  // row vector

DTensor innerProduct = x * y;

```

If necessary, we can also use the following element-wise operations

```c++

DTensor x(host_x, host_x.size());  // row vector

auto sum = x + y;

auto diff = x - y;

auto scaledX = 3.0f * x;

```

### 1.4. Matrices

To store a matrix in a `DTensor` we need to provide the data in an array; 

we can use either column-major (default) or row-major format.

```TODO implement row-major```

Suppose we need to store the matrix 

$$A = \begin{bmatrix}

1 & 2 & 3 \\

4 & 5 & 6 \\

7 & 8 & 9 \\

10 & 11 & 12 \\

13 & 14 & 15

\end{bmatrix},$$

where this data is stored in row-major format.

Then, we do

```c++

size_t rows = 5;

size_t cols = 3;

std::vector h_data{1.0f, 2.0f, 3.0f,

                          4.0f, 5.0f, 6.0f,

                          7.0f, 8.0f, 9.0f,

                          10.0f, 11.0f, 12.0f,

                          13.0f, 14.0f, 15.0f};

DTensor myTensor(h_data, rows, cols, 1, rowMajor);

```

Choose `rowMajor` or `columnMajor` as appropriate.

We can also preallocate memory for a `DTensor` as follows:

```c++

DTensor a(rows, cols, 1);

```

Then, we can upload the data as follows:

```c++

a.upload(h_data, rowMajor);

```

The copy constructor has also been implemented; 

to hard-copy a vector just do 

`DTensor myCopy(existingTensor)`.

The number of rows and columns of a `DTensor` can be 

retrieved using the methods `.numRows()` and `.numCols()` respectively.

### 1.5. More operations

The operators `+=` are `-=` supported for device matrices.

Matrix-matrix multiplication is as simple as:

```c++

size_t m = 2, k = 3, n=5;

std::vector aData{1.0f,  2.0f,  3.0f,

                         4.0f,  5.0f,  6.0f};

std::vector bData{1.0f,  2.0f,  3.0f,  4.0f,  5.0f,

                         6.0f,  7.0f,  8.0f,  9.0f, 10.0f,

                         11.0f, 12.0f, 13.0f, 14.0f, 15.0f};

DTensor A(aData, m, k, 1, rowMajor);

DTensor B(bData, k, n, 1, rowMajor);

auto X = A * B;

std::cout << A << B << X << "\n";

```

### 1.6. Tensors

As you would expect, all operations mentioned so far are supported by actual tensors

as batched operations (that is, (m,n)-matrix-wise).

Also, we can create the transposes of a `DTensor` using `.tr()`.

This transposes each (m,n)-matrix and stores it in a new `DTensor`

at the same k-index.

Transposition in-place is not possible.

### 1.7. Least squares

The solution of least squares has been implmented as a tensor method.

Say we want to solve `A\b` using least squares.

We first create $A$ and $b$

```c++

size_t m = 4;

size_t n = 3;

std::vector aData{1.0f, 2.0f, 4.0f,

                         2.0f, 13.0f, 23.0f,

                         4.0f, 23.0f, 77.0f,

                         6.0f, 7.0f, 8.0f};

std::vector bData{1.0f, 2.0f, 3.0f, 4.0f};

DTensor A(aData, m, n, 1, rowMajor);

DTensor B(bData, m);

```

Then, we can solve the system by

```c++

A.leastSquaresBatched(B);

```

The `DTensor` `B` will be overwritten with the solution.

> [!IMPORTANT]

> This particular example demonstrates how the solution may 

> overwrite only part of the given `B`, as `B` is a

> (4,1,1)-tensor and the solution is a (3,1,1)-tensor.

### 1.8. Saving and loading tensors

Tensor data can be stored in simple text files or binary files. 

The text-based format has the following structure

```text

number_of_rows

number_of_columns

number_of_matrices

data (one entry per line)

```

To save a tensor in a file, simply call `DTensor::saveToFile(filename)`.

If the file extension is `.bt` (binary tensor), the data will be stored in binary format.

The structure of the binary encoding is similar to that of the text encoding:

the first three `uint64_t`-sized positions correspond to the number of rows, columns

and matrices, followed by the elements of the tensor.

To load a tensor from a file, the static function `DTensor::parseFromFile(filename)` can be used. For example:

```c++

auto z = DTensor::parseFromFile("path/to/my.dtensor")

```

If necessary, you can provide a second argument to `parseFromFile` to specify the order in which the data are stored (the `StorageMode`).

Soon we will release a Python API for reading and serialising (numpy) arrays to `.bt` files.  

## 2. Cholesky factorisation and system solution

> [!WARNING]

> This factorisation only works with positive-definite matrices.

Here is an example:

$$A = \begin{bmatrix}

1 & 2 & 4 \\

2 & 13 & 23 \\

4 & 23 & 77

\end{bmatrix}.$$

This is how to perform a Cholesky factorisation:

```c++

size_t n = 3;

std::vector aData{1.0f, 2.0f, 4.0f,

                         2.0f, 13.0f, 23.0f,

                         4.0f, 23.0f, 77.0f};

DTensor A(aData, n, n, 1, rowMajor);

CholeskyFactoriser cfEngine(A);

status = cfEngine.factorise();

```

Then, you can solve the system `A\b`

```c++

std::vector bData{1.0f, 2.0f, 3.0f};

DTensor B(bData, n);

cfEngine.solve(B);

```

The `DTensor` `B` will be overwritten with the solution. 

## 3. Singular Value Decomposition

> [!WARNING] 

> This implementation only works with square or tall matrices. 

Here is an example with the 4-by-3 matrix

$$B = \begin{bmatrix}

1 & 2 & 3 \\

6 & 7 & 8 \\

6 & 7 & 8 \\

6 & 7 & 8 

\end{bmatrix}.$$

Evidently, the rank of $B$ is 2, so there will be two nonzero singular values.

This is how to perform an SVD decomposition:

```c++

size_t m = 4;

size_t n = 3;

std::vector bData{1.0f, 2.0f, 3.0f,

                         6.0f, 7.0f, 8.0f,

                         6.0f, 7.0f, 8.0f,

                         6.0f, 7.0f, 8.0f};

DTensor B(bData, m, n, 1, rowMajor);

SvdFactoriser svdEngine(B);

status = svdEngine.factorise();

```

By default, `SvdFactoriser` will not compute matrix $U$. If you need it,

create an instance of `SvdFactoriser` as follows

```c++

SvdFactoriser svdEngine(B, true); // computes U

```

Note that the default behaviour of `.factorise()` is to destroy

the given matrix $B$. If you want the factoriser to keep your 

matrix, you need to set the third argument of the above constructor

to `false`.

After you have factorised the matrix, you can access $S$, $V'$ and, perhaps, $U$.

You can do:

```c++

std::cout << "S = " << svdEngine.singularValues() << "\n";

std::cout << "V' = " << svdEngine.rightSingularVectors() << "\n";

```

Note that $U$ can be obtained, if it is computed 

in the first place, by the method 

`.leftSingularVectors()` which returns an object 

of type [`std::optional>`](https://dev.to/delta456/modern-c-stdoptional-58ga).

Here is an example:

```c++

auto U = svdEngine.leftSingularVectors();

if (U) std::cout << "U = " << U.value();

```

## 4. Projection onto a nullspace

The nullspace of a matrix is computed by SVD.

The user provides a `DTensor` made of (padded) matrices.

Then, `Nullspace` computes, possibly pads, and returns the

nullspace matrices `N = (N1, ..., Nk)` in another `DTensor`.

```c++

DTensor paddedMatrices(m, n, k);

Nullspace N(paddedMatrices);  // computes N and NN'

DTensor ns = N.nullspace();  // returns N

```

Each padded nullspace matrix `Ni` is orthogonal, 

and `Nullspace` further computes and stores the

nullspace projection operators `NN' = (N1N1', ..., NkNk')`.

This allows the user to project-in-place onto the nullspace.

```c++

DTensor vectors(m, 1, k);

N.project(vectors);

std::cout << vectors << "\n";

```

## 5. Other

We can get the total allocated bytes (on the GPU) with

```c++

size_t allocatedBytes = 

    Session::getInstance().totalAllocatedBytes();

```

GPUtils supports multiple streams. By default a single stream is created, 

but you can set the number of streams you need with 

```c++

/* This needs to be the first line in your code */

Session::setStreams(4); // create 4 strems

```

Then, you can use `setStreamIdx` to select a stream to go with your instance of `DTensor`

```c++

auto a = DTensor::createRandomTensor(3, 6, 4, -1, 1).setStreamIdx(0);

auto b = DTensor::createRandomTensor(3, 6, 4, -1, 1).setStreamIdx(1);

// do stuff...

Session::getInstance().synchronizeAllStreams();

```

## Happy number crunching!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gpuengineering/gputils

Awesome Lists containing this project

README