{"id":30195165,"url":"https://github.com/gpuengineering/gputils","last_synced_at":"2025-08-13T03:47:46.860Z","repository":{"id":232661673,"uuid":"784886172","full_name":"GPUEngineering/GPUtils","owner":"GPUEngineering","description":"A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)","archived":false,"fork":false,"pushed_at":"2025-03-28T22:53:57.000Z","size":411,"stargazers_count":4,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-30T10:09:41.316Z","etag":null,"topics":["cplusplus-17","cplusplus-20","cpp","cuda","cuda-c","cuda-cpp","cuda-programming","header-only","linear-algebra"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GPUEngineering.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-10T18:52:56.000Z","updated_at":"2025-03-28T21:27:27.000Z","dependencies_parsed_at":"2024-05-22T12:52:49.130Z","dependency_job_id":"a565085b-3893-4d2c-b70b-5a609477be76","html_url":"https://github.com/GPUEngineering/GPUtils","commit_stats":null,"previous_names":["gpuengineering/gputils"],"tags_count":20,"template":false,"template_full_name":null,"purl":"pkg:github/GPUEngineering/GPUtils","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GPUEngineering%2FGPUtils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GPUEngineering%2FGPUtils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GPUEngineering%2FGPUtils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GPUEngineering%2FGPUtils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GPUEngineering","download_url":"https://codeload.github.com/GPUEngineering/GPUtils/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GPUEngineering%2FGPUtils/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270177771,"owners_count":24540324,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-13T02:00:09.904Z","response_time":66,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cplusplus-17","cplusplus-20","cpp","cuda","cuda-c","cuda-cpp","cuda-programming","header-only","linear-algebra"],"created_at":"2025-08-13T03:47:44.921Z","updated_at":"2025-08-13T03:47:46.830Z","avatar_url":"https://github.com/GPUEngineering.png","language":"Cuda","readme":"# GPUtils\n\n## 1. DTensor\n\nThe `DTensor` class is for manipulating data on a GPU. \nIt manages their memory and facilitates various algebraic operations.\n\nA tensor has three axes: `[rows (m) x columns (n) x matrices (k)]`.\nAn (m,n,1)-tensor stores a _matrix_, and an (m,1,1)-tensor stores a _vector_.\n\nWe first need to decide on a data type between `float` or `double`.\nWe will use `float` in the following examples.\n\n### 1.1. Vectors\n\nThe simplest way to create an empty `DTensor` object is by constructing a vector:\n\n```c++\nsize_t n = 100;\nDTensor myTensor(n);\n```\n\n\u003e [!IMPORTANT] \n\u003e This creates an n-dimensional vector as an (n,1,1)-tensor on the device.\n\nA `DTensor` can be instantiated from host memory:\n\n```c++\nstd::vector\u003cfloat\u003e h_a{4., -5., 6., 9., 8., 5., 9., -10.2, 9., 11.};\nDTensor\u003cfloat\u003e myTensor(h_a, h_a.size());\nstd::cout \u003c\u003c myTensor \u003c\u003c \"\\n\";\n```\n\n\u003e [!CAUTION]\n\u003e Printing a `DTensor` to `std::cout` will slow down your program\n\u003e (it requires the data to be downloaded from the device).\n\u003e Printing was designed for quick debugging.\n\nWe will often need to create slices (or shallow copies) of a `DTensor` \ngiven a range of values. We can then do:\n\n```c++\nsize_t axis = 0;  // rows=0, cols=1, mats=2\nsize_t from = 3;\nsize_t to = 5;\nDTensor\u003cfloat\u003e mySlice(myTensor, axis, from, to);\nstd::cout \u003c\u003c mySlice \u003c\u003c \"\\n\";\n```\n\nSometimes we need to reuse an already allocated `DTensor` by uploading \nnew data from the host by using the method `upload`. Here is a short example:\n\n```c++\nstd::vector\u003cfloat\u003e h_a{1., 2., 3.};  // host data a\nDTensor\u003cfloat\u003e myVec(h_a, 3);  // create vector in tensor on device\nstd::vector\u003cfloat\u003e h_b{4., -5., 6.};  // host data b\nmyVec.upload(h_b);\nstd::cout \u003c\u003c myVec \u003c\u003c \"\\n\";\n```\n\nWe can upload some host data to a particular position of a `DTensor` as follows:\n\n```c++\nstd::vector\u003cfloat\u003e hostData{1., 2., 3.};\n// here, `true` tells the constructor to set all allocated elements to zero\nDTensor\u003cfloat\u003e x(7, 1, 1, true);  // x = [0, 0, 0, 0, 0, 0, 0]'\nDTensor\u003cfloat\u003e mySlice(x, 0, 3, 5); \nmySlice.upload(hostData);\nstd::cout \u003c\u003c x \u003c\u003c \"\\n\";  // x = [0, 0, 0, 1, 2, 3, 0]'\n```\n\nIf necessary, the data can be downloaded from the device to the host using \n`download`.\n\nVery often we will also need to copy data from an existing `DTensor`\nto another `DTensor` (without passing through the host).\nTo do this we can use `deviceCopyTo`. Here is an example:\n\n```c++\nDTensor\u003cfloat\u003e x(10);\nDTensor\u003cfloat\u003e y(10);\nx.deviceCopyTo(y);  // x ---\u003e y (device memory to device memory)\n```\n\nThe copy constructor has also been implemented; to hard-copy a `DTensor` just\ndo `DTensor\u003cfloat\u003e myCopy(existingTensor)`.\n\nLastly, a not so efficient method that should only be used for \ndebugging, if at all, is the `()` operator (e.g., `x(i, j, k)`), which fetches\none element of the `DTensor` to the host.\nThis cannot be used to set a value, so don't do anything like `x(0, 0, 0) = 4.5`!\n\u003e [!CAUTION]\n\u003e For the love of god, do not put this `()` operator in a loop.\n\n### 1.2. Computation of scalar quantities\n\nThe following scalar quantities can be computed (internally, \nwe use `cublas` functions):\n\n- `.normF()`: the Frobenius norm of a tensor $x$, using `nrm2` (i.e., the 2-norm, or Euclidean norm, if $x$ is a vector)\n- `.sumAbs()`: the sum of the absolute of all the elements, using `asum` (i.e., the 1-norm if $x$ is a vector)\n\n### 1.3. Some cool operators\n\nWe can element-wise add `DTensor`s on the device as follows:\n\n```c++\nstd::vector\u003cfloat\u003e host_x{1., 2., 3., 4., 5., 6.,  7.};\nstd::vector\u003cfloat\u003e host_y{1., 3., 5., 7., 9., 11., 13.};\nDTensor\u003cfloat\u003e x(host_x, host_x.size());\nDTensor\u003cfloat\u003e y(host_y, host_y.size());\nx += y;  // x = [2, 5, 8, 11, 14, 17, 20]'\nstd::cout \u003c\u003c x \u003c\u003c \"\\n\";\n```\n\nTo element-wise subtract `y` from `x` we can use `x -= y`.\n\nWe can also scale a `DTensor` by a scalar with `*=` (e.g, `x *= 5.0f`). \nTo negate the values of a `DTensor` we can do `x *= -1.0f`.\n\nWe can also compute the inner product (as a (1,1,1)-tensor) of two vectors as follows:\n\n```c++\nstd::vector\u003cfloat\u003e host_x{1., 2., 3., 4., 5., 6.,  7.};\nstd::vector\u003cfloat\u003e host_y{1., 3., 5., 7., 9., 11., 13.};\nDTensor\u003cfloat\u003e xtr(host_x, 1, host_x.size());  // column vector\nDTensor\u003cfloat\u003e y(host_y, host_y.size());  // row vector\nDTensor\u003cfloat\u003e innerProduct = x * y;\n```\n\nIf necessary, we can also use the following element-wise operations\n\n```c++\nDTensor\u003cfloat\u003e x(host_x, host_x.size());  // row vector\nauto sum = x + y;\nauto diff = x - y;\nauto scaledX = 3.0f * x;\n```\n\n### 1.4. Matrices\n\nTo store a matrix in a `DTensor` we need to provide the data in an array; \nwe can use either column-major (default) or row-major format.\n```TODO implement row-major```\nSuppose we need to store the matrix \n\n$$A = \\begin{bmatrix}\n1 \u0026 2 \u0026 3 \\\\\n4 \u0026 5 \u0026 6 \\\\\n7 \u0026 8 \u0026 9 \\\\\n10 \u0026 11 \u0026 12 \\\\\n13 \u0026 14 \u0026 15\n\\end{bmatrix},$$\n\nwhere this data is stored in row-major format.\nThen, we do\n```c++\nsize_t rows = 5;\nsize_t cols = 3;\nstd::vector\u003cfloat\u003e h_data{1.0f, 2.0f, 3.0f,\n                          4.0f, 5.0f, 6.0f,\n                          7.0f, 8.0f, 9.0f,\n                          10.0f, 11.0f, 12.0f,\n                          13.0f, 14.0f, 15.0f};\nDTensor\u003cfloat\u003e myTensor(h_data, rows, cols, 1, rowMajor);\n```\n\nChoose `rowMajor` or `columnMajor` as appropriate.\n\nWe can also preallocate memory for a `DTensor` as follows:\n\n```c++\nDTensor\u003cfloat\u003e a(rows, cols, 1);\n```\n\nThen, we can upload the data as follows:\n\n```c++\na.upload(h_data, rowMajor);\n```\n\nThe copy constructor has also been implemented; \nto hard-copy a vector just do \n`DTensor\u003cfloat\u003e myCopy(existingTensor)`.\n\nThe number of rows and columns of a `DTensor` can be \nretrieved using the methods `.numRows()` and `.numCols()` respectively.\n\n### 1.5. More operations\n\nThe operators `+=` are `-=` supported for device matrices.\n\nMatrix-matrix multiplication is as simple as:\n\n```c++\nsize_t m = 2, k = 3, n=5;\nstd::vector\u003cfloat\u003e aData{1.0f,  2.0f,  3.0f,\n                         4.0f,  5.0f,  6.0f};\nstd::vector\u003cfloat\u003e bData{1.0f,  2.0f,  3.0f,  4.0f,  5.0f,\n                         6.0f,  7.0f,  8.0f,  9.0f, 10.0f,\n                         11.0f, 12.0f, 13.0f, 14.0f, 15.0f};\nDTensor\u003cfloat\u003e A(aData, m, k, 1, rowMajor);\nDTensor\u003cfloat\u003e B(bData, k, n, 1, rowMajor);\nauto X = A * B;\nstd::cout \u003c\u003c A \u003c\u003c B \u003c\u003c X \u003c\u003c \"\\n\";\n```\n\n### 1.6. Tensors\n\nAs you would expect, all operations mentioned so far are supported by actual tensors\nas batched operations (that is, (m,n)-matrix-wise).\n\nAlso, we can create the transposes of a `DTensor` using `.tr()`.\nThis transposes each (m,n)-matrix and stores it in a new `DTensor`\nat the same k-index.\nTransposition in-place is not possible.\n\n### 1.7. Least squares\n\nThe solution of least squares has been implmented as a tensor method.\nSay we want to solve `A\\b` using least squares.\nWe first create $A$ and $b$\n\n```c++\nsize_t m = 4;\nsize_t n = 3;\nstd::vector\u003cfloat\u003e aData{1.0f, 2.0f, 4.0f,\n                         2.0f, 13.0f, 23.0f,\n                         4.0f, 23.0f, 77.0f,\n                         6.0f, 7.0f, 8.0f};\nstd::vector\u003cfloat\u003e bData{1.0f, 2.0f, 3.0f, 4.0f};\nDTensor\u003cfloat\u003e A(aData, m, n, 1, rowMajor);\nDTensor\u003cfloat\u003e B(bData, m);\n```\n\nThen, we can solve the system by\n\n```c++\nA.leastSquaresBatched(B);\n```\n\nThe `DTensor` `B` will be overwritten with the solution.\n\n\u003e [!IMPORTANT]\n\u003e This particular example demonstrates how the solution may \n\u003e overwrite only part of the given `B`, as `B` is a\n\u003e (4,1,1)-tensor and the solution is a (3,1,1)-tensor.\n\n### 1.8. Saving and loading tensors\n\nTensor data can be stored in simple text files or binary files. \nThe text-based format has the following structure\n\n```text\nnumber_of_rows\nnumber_of_columns\nnumber_of_matrices\ndata (one entry per line)\n```\n\nTo save a tensor in a file, simply call `DTensor::saveToFile(filename)`.\n\nIf the file extension is `.bt` (binary tensor), the data will be stored in binary format.\nThe structure of the binary encoding is similar to that of the text encoding:\nthe first three `uint64_t`-sized positions correspond to the number of rows, columns\nand matrices, followed by the elements of the tensor.\n\nTo load a tensor from a file, the static function `DTensor\u003cT\u003e::parseFromFile(filename)` can be used. For example:\n\n```c++\nauto z = DTensor\u003cdouble\u003e::parseFromFile(\"path/to/my.dtensor\")\n```\n\nIf necessary, you can provide a second argument to `parseFromFile` to specify the order in which the data are stored (the `StorageMode`).\n\nSoon we will release a Python API for reading and serialising (numpy) arrays to `.bt` files.  \n\n## 2. Cholesky factorisation and system solution\n\n\u003e [!WARNING]\n\u003e This factorisation only works with positive-definite matrices.\n\nHere is an example:\n\n$$A = \\begin{bmatrix}\n1 \u0026 2 \u0026 4 \\\\\n2 \u0026 13 \u0026 23 \\\\\n4 \u0026 23 \u0026 77\n\\end{bmatrix}.$$\n\nThis is how to perform a Cholesky factorisation:\n\n```c++\nsize_t n = 3;\nstd::vector\u003cfloat\u003e aData{1.0f, 2.0f, 4.0f,\n                         2.0f, 13.0f, 23.0f,\n                         4.0f, 23.0f, 77.0f};\nDTensor\u003cfloat\u003e A(aData, n, n, 1, rowMajor);\nCholeskyFactoriser\u003cfloat\u003e cfEngine(A);\nstatus = cfEngine.factorise();\n```\n\nThen, you can solve the system `A\\b`\n\n```c++\nstd::vector\u003cfloat\u003e bData{1.0f, 2.0f, 3.0f};\nDTensor\u003cfloat\u003e B(bData, n);\ncfEngine.solve(B);\n```\n\nThe `DTensor` `B` will be overwritten with the solution. \n\n## 3. Singular Value Decomposition\n\n\u003e [!WARNING] \n\u003e This implementation only works with square or tall matrices. \n\nHere is an example with the 4-by-3 matrix\n\n$$B = \\begin{bmatrix}\n1 \u0026 2 \u0026 3 \\\\\n6 \u0026 7 \u0026 8 \\\\\n6 \u0026 7 \u0026 8 \\\\\n6 \u0026 7 \u0026 8 \n\\end{bmatrix}.$$\n\nEvidently, the rank of $B$ is 2, so there will be two nonzero singular values.\n\nThis is how to perform an SVD decomposition:\n\n```c++\nsize_t m = 4;\nsize_t n = 3;\nstd::vector\u003cfloat\u003e bData{1.0f, 2.0f, 3.0f,\n                         6.0f, 7.0f, 8.0f,\n                         6.0f, 7.0f, 8.0f,\n                         6.0f, 7.0f, 8.0f};\nDTensor\u003cfloat\u003e B(bData, m, n, 1, rowMajor);\nSvdFactoriser\u003cfloat\u003e svdEngine(B);\nstatus = svdEngine.factorise();\n```\n\nBy default, `SvdFactoriser` will not compute matrix $U$. If you need it,\ncreate an instance of `SvdFactoriser` as follows\n\n```c++\nSvdFactoriser\u003cfloat\u003e svdEngine(B, true); // computes U\n```\n\nNote that the default behaviour of `.factorise()` is to destroy\nthe given matrix $B$. If you want the factoriser to keep your \nmatrix, you need to set the third argument of the above constructor\nto `false`.\n\nAfter you have factorised the matrix, you can access $S$, $V'$ and, perhaps, $U$.\nYou can do:\n\n```c++\nstd::cout \u003c\u003c \"S = \" \u003c\u003c svdEngine.singularValues() \u003c\u003c \"\\n\";\nstd::cout \u003c\u003c \"V' = \" \u003c\u003c svdEngine.rightSingularVectors() \u003c\u003c \"\\n\";\n```\n\nNote that $U$ can be obtained, if it is computed \nin the first place, by the method \n`.leftSingularVectors()` which returns an object \nof type [`std::optional\u003cDeviceMatrix\u003cTElement\u003e\u003e`](https://dev.to/delta456/modern-c-stdoptional-58ga).\nHere is an example:\n\n```c++\nauto U = svdEngine.leftSingularVectors();\nif (U) std::cout \u003c\u003c \"U = \" \u003c\u003c U.value();\n```\n\n## 4. Projection onto a nullspace\n\nThe nullspace of a matrix is computed by SVD.\nThe user provides a `DTensor` made of (padded) matrices.\nThen, `Nullspace` computes, possibly pads, and returns the\nnullspace matrices `N = (N1, ..., Nk)` in another `DTensor`.\n\n```c++\nDTensor\u003cfloat\u003e paddedMatrices(m, n, k);\nNullspace N(paddedMatrices);  // computes N and NN'\nDTensor\u003cfloat\u003e ns = N.nullspace();  // returns N\n```\n\nEach padded nullspace matrix `Ni` is orthogonal, \nand `Nullspace` further computes and stores the\nnullspace projection operators `NN' = (N1N1', ..., NkNk')`.\nThis allows the user to project-in-place onto the nullspace.\n\n```c++\nDTensor\u003cfloat\u003e vectors(m, 1, k);\nN.project(vectors);\nstd::cout \u003c\u003c vectors \u003c\u003c \"\\n\";\n```\n\n## 5. Other\n\nWe can get the total allocated bytes (on the GPU) with\n```c++\nsize_t allocatedBytes = \n    Session::getInstance().totalAllocatedBytes();\n```\n\nGPUtils supports multiple streams. By default a single stream is created, \nbut you can set the number of streams you need with \n```c++\n/* This needs to be the first line in your code */\nSession::setStreams(4); // create 4 strems\n```\nThen, you can use `setStreamIdx` to select a stream to go with your instance of `DTensor`\n```c++\nauto a = DTensor\u003cdouble\u003e::createRandomTensor(3, 6, 4, -1, 1).setStreamIdx(0);\nauto b = DTensor\u003cdouble\u003e::createRandomTensor(3, 6, 4, -1, 1).setStreamIdx(1);\n// do stuff...\nSession::getInstance().synchronizeAllStreams();\n```\n\n## Happy number crunching!\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpuengineering%2Fgputils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgpuengineering%2Fgputils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpuengineering%2Fgputils/lists"}