{"id":18408820,"url":"https://github.com/codeplaysoftware/portblas","last_synced_at":"2025-03-12T14:06:48.745Z","repository":{"id":19846262,"uuid":"67423069","full_name":"codeplaysoftware/portBLAS","owner":"codeplaysoftware","description":"An implementation of BLAS using the SYCL open standard.","archived":false,"fork":false,"pushed_at":"2024-12-01T19:23:11.000Z","size":2992,"stargazers_count":260,"open_issues_count":3,"forks_count":50,"subscribers_count":24,"default_branch":"main","last_synced_at":"2024-12-06T19:22:16.710Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codeplaysoftware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-09-05T13:24:34.000Z","updated_at":"2024-11-25T11:33:53.000Z","dependencies_parsed_at":"2024-04-09T16:34:52.276Z","dependency_job_id":"84de5ddd-a96a-4a77-a785-c7d1d71fa50e","html_url":"https://github.com/codeplaysoftware/portBLAS","commit_stats":null,"previous_names":["codeplaysoftware/portblas"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeplaysoftware%2FportBLAS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeplaysoftware%2FportBLAS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeplaysoftware%2FportBLAS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeplaysoftware%2FportBLAS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codeplaysoftware","download_url":"https://codeload.github.com/codeplaysoftware/portBLAS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243230105,"owners_count":20257643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T03:21:40.355Z","updated_at":"2025-03-12T14:06:48.710Z","avatar_url":"https://github.com/codeplaysoftware.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# portBLAS Implementation\n\n[![OpenSSF Scorecard](https://api.scorecard.dev/projects/github.com/codeplaysoftware/portBLAS/badge)](https://scorecard.dev/viewer/?uri=github.com/codeplaysoftware/portBLAS)\n\nportBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/).\n\n**Important Note:** The portBLAS project has been transfered to UXL foundation under\n[generic-sycl-components](https://github.com/uxlfoundation/generic-sycl-components).\nFor an equivalent of portBLAS please use\n[oneMath](https://github.com/uxlfoundation/oneMath) and only enable the generic\nSYCL BLAS backend.\n\nportBLAS is an ongoing collaboration with the *High Performance Computing \n\u0026 Architectures (HPCA) group* from the Universitat Jaume I [UJI](http://www.hpca.uji.es/).\n\nportBLAS is written using modern C++. The current implementation uses C++11\nfeatures.\nSee [Roadmap](Roadmap.md) for details on the current status and plans for\nthe project.\n\n## Table of Contents\n\n- [portBLAS Implementation](#portBLAS-implementation)\n  - [Table of Contents](#table-of-contents)\n  - [Motivation](#motivation)\n  - [Basic Concepts](#basic-concepts)\n    - [Views](#views)\n    - [Operations](#operations)\n    - [SB\\_Handle](#sb_handle)\n    - [Interface](#interface)\n  - [API description](#api-description)\n    - [BLAS 1](#blas-1)\n    - [BLAS 2](#blas-2)\n    - [BLAS 3](#blas-3)\n    - [Experimental Joint Matrix Support](#jm_support)\n  - [Requirements](#requirements)\n  - [Setup](#setup)\n    - [Compile with DPC++](#compile-with-dpc)\n    - [Compile with AdaptiveCpp *(Formerly hipSYCL)*](#compile-with-adaptivecpp)\n    - [Instaling portBLAS](#instaling-portBLAS)\n    - [Doxygen](#doxygen)\n    - [CMake options](#cmake-options)\n  - [Tests and benchmarks](#tests-and-benchmarks)\n  - [Contributing to the project](#contributing-to-the-project)\n    - [Guides and Other Documents](#guides-and-other-documents)\n\n## Motivation\n\nThe same numerical operations are computed to solve many scientific problems\nand engineering applications, such as image and signal processing,\ntelecommunication, computational finance, materials science simulations,\nstructural biology, data mining, bio-informatics, fluid dynamics, and many other\nareas. Thus, it was identified that around the 90% percent of the computational\ncost is consumed on the 10% of the code, and therefore any improvement in this\n10% of code would have a great impact in the performances of the applications.\nNumerical Linear Algebra is the science area in charge of identifying the most\ncommon operations and seeking their best implementation. To do this, the\nresearchers should consider the numerical stability of the selected algorithm,\nand the platform on which the operation will be solved. The first analysis\nstudies the accuracy of the solution while the second one compares the\nperformances of the different implementations to select the best one.\n\nNowadays, all the numerical computations are based on a set of standard\nlibraries on which the most common operations are implemented. These libraries\nare different for dense matrices (BLAS, LAPACK, ScaLAPACK, ...) and for sparse\nmatrices (SparseBLAS, ...). Moreover, there are  vendor implementations which\nare adjusted to the platform features:\n  - For multicores: ACML (AMD), ATLAS, Intel-MKL, OpenBLAS, ...\n  - For GPUs: cuBLAS (Nvidia), clBLAS, CLBlast, MAGMA, ...\n\nBut, in any case, BLAS is always the lowest level in the hierarchy\nof numerical libraries, such that\na good BLAS implementation improves the performances of all the other\nlibraries.  The development of numerical libraries on SYCL is one of the most\nimportant objectives, because it will improve the performance of other SYCL\napplications. Obviously, it makes sense portBLAS was the first step in this\ntask.\n\nOn GPUs, the data communication to/from the device and the grain of the kernels\nplay an important rule on the performances of the developments. On one\nhand, to reduce the communication cost, the most of the data should be mapped\non the device, even the scalars. On the other hand, growing the size of the\nkernels allows the CPU to complete other tasks while the GPU is computing or to\nenter an energy-efficient C-state, reducing the energy consumption.\n\nTo enlarge the grain of the kernels is a complex task, in which many aspects\nshould be considered as the dependency between kernels, the grid topology, the\ngrid sizes, etc. This complexity justifies that, usually, the fused kernels are\nmanually written. An alternative to simplify this task could be to build a\nexpression tree on which all the single operation which are required to solve a\nproblem appears. This structure could be analysed by the compiler to decide how\nto merge the different kernel and the best grid topology to execute the fused\nkernel.  The use of expression trees is one of most important features of\nportBLAS.\n\n## Basic Concepts\n\nportBLAS uses C++ Expression Tree templates to generate SYCL Kernels via\nkernel composition.\nExpression Tree templates are a widely used technique to implement expressions\non C++, that facilitate development and composition of operations.\nIn particular,\n[Kernel composition in SYCL](http://dl.acm.org/citation.cfm?id=2791332) has\nbeen used in various projects to create efficient domain-specific embedded\nlanguages that enable users to easily fuse GPU kernels.\n\nportBLAS can be used\n- either as a header-only framework by including `portblas.hpp` in\nan application and passing the `src` folder in the list of include directories\n- or as a library by including `portblas.h` in an application.\n\nAll the relevant files can be found in\nthe `include` directory.\n\nThere are four components in portBLAS, the *View*, the *Operations*,\nthe *SB_Handle* and the *Interface* itself.\n\n### Views\n\nThe input data to all the operations in portBLAS is passed to the library\nusing *Views*.\nA *View* represents data on top of a container, passed by reference.\nViews *do not store data*, they only map a visualization of the data on top\nof a container.\nThis enables the library to implement the different indexing modes of the\nBLAS API, such as strides.\nNote than a view can be of a different size than a container.\n\nAll views derive from the base view class or the base matrix view class, which\nrepresents a view of a container as a vector or as a matrix.\nThe container does not need to be multi-dimensional to store a matrix.\nThe current restriction is that container must obey the \n[*LegacyRandomAccessIterator*](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator)\nproperties of the C++11 standard.\n\n### Operations\n\nOperations among elements of vectors (or matrices) are expressed in the\nset of Operation Classes.\nOperations are templated classes that take templated types as input.\nOperations form the nodes of the portBLAS expression tree.\nRefer to the documentation of each node type for details.\n\nComposing these is how the compile-time Expression tree is created:\nGiven an operation node, the leaves of the node are other Operations.\nThe leaf nodes of an Expression Tree are Views or Scalar types (data).\nThe intermediate nodes of the Expression Tree are operations (e.g,\nbinary operations, unary operations, etc).\n\n### SB_Handle\n\nAn SB_Handle traverses the Expression Tree to evaluate the operations that it\ndefines.\nSB_Handle use different techniques to evaluate the expression tree.\nThe SYCL evaluator transform the tree into a device tree (i.e, converting\nbuffer to accessors) and then evaluates the Expression Tree on the device.\n\n### Interface\n\nThe different headers on the interface directory implement the traditional\nBLAS interface.\nFiles are organised per BLAS level (1, 2, 3).\n\nWhen the portBLAS BLAS interface is called, the Expression Tree for each\noperation is constructed, and then executed.\nSome API calls may execute several kernels (e.g, when a reduction is required).\nThe expression trees in the API allow to compile-time fuse operations.\n\nNote that, although this library features a BLAS interface, users are allowed\nto directly compose their own expression trees to compose multiple operations.\nThe CG example shows an implementation of the Conjugate Gradient that uses\nvarious expression tree to demonstrate how to achieve compile-time kernel fusion\nof multiple BLAS operations.\n\n## API description\n\nThis section references all the supported operations and their interface. The \nlibrary follows the [oneAPI MKL BLAS specification](https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/blas/blas.html) \nas reference for the api. We have support for both USM and Buffer api, however \nthe group apis for USM are not supported. We don't support mixing USM and Buffer \narguments together to compile the library, and instead stick to the aformentioned \nreference specification.\n\nAll operations take as their first argument a reference to the SB_Handle, a\n`blas::SB_Handle` created with a `sycl::queue`. The last argument for all operators\nis a vector of dependencies of type `sycl::event` (empty by default). The return value\nis usually an array of SYCL events (except for some operations that can return a scalar or\na tuple). The containers for the vectors and matrices (and scalars written by\nthe BLAS operations) can either be `raw usm pointers` or `iterator buffers` that can be \ncreated with a call to `sycl::malloc_device` or `make_sycl_iterator_buffer` respectively.\n\nThe USM support in portBLAS is limited to `device allocated` memory only and we don't support\n`shared` or `host` allocations with USM. \n\nWe recommend checking the [samples](samples) to get started with portBLAS. It\nis better to be familiar with BLAS:\n\n- [Wikipedia](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms)\n- [Netlib reference](http://www.netlib.org/lapack/explore-html/d1/df9/group__blas.html)\n\n### BLAS 1\n\nThe following table sums up the interface that can be found in\n[blas1_interface.h](include/interface/blas1_interface.h).\n\nFor all these operations:\n\n* `vx` and `vy` are containers for vectors `x` and `y`.\n* `incx` and `incy` are their increments *(number of steps to jump to the next\n   value, 1 for contiguous values)*.\n* `N`, an integer, is the size of the vectors *(less than or equal to the size of\n  the containers)*.\n* `alpha` is a scalar.\n* `rs` is a container of size 1, containing either a scalar, an integer, or an\n  index-value tuple.\n* `c` and `s` for `_rot` are scalars *(cosine and sine)*.\n* `sb` for `_sdsdot` is a single precision scalar to be added to output.\n\n| operation | arguments                                       | description                                                                                                                                                                  |\n|-----------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `_asum`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Absolute sum of the vector `x`; written in `rs` if passed, else returned                                                                                                                               |\n| `_axpy`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`  | Vector multiply-add: `y = alpha * x + y`                                                                                                                                     |\n| `_copy`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Copies a vector to another: `y = x`                                                                                                                                          |\n| `_dot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`   [, `rs`] | Dot product of two vectors `x` and `y`; written to `rs` if passed, else returned                                                                                             |\n| `_sdsdot`    | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy`[, `rs`] | Compute sum of a constant `sb` with the double precision  dot product of two single precision vectors `x` and `y`; written in `rs` if passed, else returned                                                                                            |\n| `_nrm2`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Euclidean norm of the vector `x`; written in `rs` if passed, else returned                                                                                                   |\n| `_rot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s`                                                                                                     |\n| `_rotg`   | `sb_handle`, `a`, `b`, `c`, `s`                        | Given the Cartesian coordinates (`a`, `b`) of a point, return the parameters `c`, `s`, `r`, and `z` associated with the Givens rotation.                                     |\n| `_rotm`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `param`  | Applies a modified Givens rotation to `x` and `y`.                                                                                                                           |\n| `_rotmg`  | `sb_handle`, `d1`, `d2`, `x1`, `y1` `param`            | Given the Cartesian coordinates (`x1`, `y1`) of a point, return the components of a modified Givens transformation matrix that zeros the y-component of the resulting point. |\n| `_scal`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`                | Scalar product of a vector: `x = alpha * x`                                                                                                                                  |\n| `_swap`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Interchanges two vectors: `y = x` and `x = y`                                                                                                                                |\n| `_iamax`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Index of the first occurence of the maximum element in `x`; written to `rs` if passed, else returned.                                                              |\n| `_iamin`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Index of the first occurence of the minimum element in `x`; written to `rs` if passed, else returned.                                                            |\n\n### BLAS 2\n\nThe following table sums up the interface that can be found in\n[blas2_interface.h](include/interface/blas2_interface.h).\n\nFor all these operations:\n\n* `trans` is a `char` representing the transpose mode of the matrix: `'n'`,\n  `'t'`, or `'c'`; respectively identity, transpose and Hermitian transpose\n  (note: the latter is not relevant yet as complex numbers are not supported).\n* `uplo` is a `char` that provides information about triangular matrices: `u` for\n  upper triangular and `l` for lower triangular matrices.\n* `diag` is a `char` that provides information about the diagonal elements of a\n  triangular matrix: `u` if the matrix is unit triangular (all diagonal elements\n  are 1), else `n`.\n* `M` and `N` are the numbers of rows and columns of the matrix. They also\n  determine the sizes of the vectors so that dimensions match, depending on the\n  BLAS operation. For operations on square matrices, only `N` is given.\n* `alpha` and `beta` are scalars.\n* `mA` is a container for a column-major matrix `A`.\n* `lda` is the leading dimension of `mA`, i.e the step between an element and\n  its neighbor in the next column and same row. `lda` must be at least `M`.\n* `vx` and `vy` are containers for vectors `x` and `y`.\n* `incx` and `incy` are their increments (cf BLAS 1).\n* `K` Number of sub/super-diagonals of the matrix.\n\n| operation | arguments | description |\n|---|---|---|\n| `_gbmv` | `sb_handle`, `trans`, `M`, `N`, `KL`, `KU`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy`  | Generalised band matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* |\n| `_gemv` | `sb_handle`, `trans`, `M`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy`  | Generalised matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* |\n| `_ger` | `sb_handle`, `M`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mA`, `lda` | Generalised vector-vector product followed by a matrix sum: `A = alpha * x * yT + A` |\n| `_sbmv`| `sb_handle`, `uplo`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Compute a scalar-matrix-vector product and add the result to a scalar-vector product, with a symmetric band matrix: `y = alpha * mA * x + beta * y` |\n| `_spmv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `vx`, `incx`, `beta`, `vy`, `incy` |  Symmetric packed matrix-vector product: `y = alpha * A * x + beta * y` |\n| `_spr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mPA` | Symmetric vector-vector product followed by a matrix sum: `mPA = alpha * x * xT + mPA` |\n| `_spr2` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mPA` | Compute two scalar-vector-vector products and add them to a symmetric packed matrix: `mPA = alpha * x * yT + alpha * y * xT + mPA` |\n| `_symv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Variant of GEMV for a symmetric matrix (`y = alpha * A * x + beta * y`). *Note: `uplo` specifies which side of the matrix will be read* |\n| `_syr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mA`, `lda` | Generalised vector squaring followed by a sum with a symmetric matrix: `A = alpha * x * xT + A` |\n| `_syr2` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mA`, `lda` | Generalised vector products followed by a sum with a symmetric matrix: `A = alpha*x*yT + alpha*y*xT + A` |\n| `_tbmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `K`, `mA`, `lda`, `vx`, `incx` |  Compute a matrix-vector product with a triangular band matrix:  `A = A * x` |\n| `_tbsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `K`, `mA`, `lda`, `vx`, `incx` | Solve a system of linear equations whose coefficients are in a triangular band matrix: `A * x = b` |\n| `_tpmv`| `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Triangular packed matrix-vector product: `x = A * x` |\n| `_tpsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Solve a system of linear equations whose coefficients are in a triangular packed matrix: `A * x = b` |\n| `_trmv`  | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx` | Matrix-vector product for a triangular matrix: `x = A * x` |\n| `_trsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `lda`, `vx`, `incx` | Compute a matrix-vector product with a triangular band matrix: `A * x = b` |\n\n### BLAS 3\n\nThe following table sums up the interface that can be found in\n[blas3_interface.h](include/interface/blas3_interface.h).\n\nFor all these operations:\n\n* `mA`, `mB` and `mC` are containers for the column-major matrices A, B and C.\n* `lda`, `ldb` and `ldc` are the leading dimensions of the matrices A, B and C\n  (cf BLAS 2). The leading dimension of a matrix must be greater than or equal\n  to its number of rows.\n* `transa` and `transb` are the transpose modes of the matrices A and B\n  (cf BLAS 2).\n* `M`, `N` and `K` are the dimensions of the matrices. The dimensions\n  **after transposition** are A: `M`x`K`, B: `K`x`N`, C: `M`x`N`.\n* `alpha` and `beta` are scalars.\n* `batch_size` is an integer.\n* `side` is `l` for left or `r` for right.\n* `uplo` is a `char` that provides information about triangular matrices: `u` for\n  upper triangular and `l` for lower triangular matrices.\n* `diag` is a `char` that provides information about the diagonal elements of a\n  triangular matrix: `u` if the matrix is unit triangular *(all diagonal elements\n  are 1)*, else `n`.\n* `stride_a`, `stride_b` and `stride_c` are the striding size between consecutive \nmatrices in a batched-strided entry for the inputs/outputs. \n* `batch_type` for `_gemm_batched` is either `strided` *(by default)* or `interleaved`*(More details about it here : [Gemm.md](doc/Gemm.md))*.\n\n| operation | arguments | description |\n|---|---|---|\n| `_gemm` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc` | Generalised matrix-matrix multiplication followed by matrix addition: `C = alpha * A * B + beta * C` |\n| `_gemm_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc`, `batch_size`, `batch_type` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. |\n| `_gemm_strided_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `stride_a`, `mB`, `ldb`, `stride_b`, `beta`, `mC`, `ldc`, `stride_c`, `batch_size` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices.\n| `_symm` | `sb_handle`, `side` , `uplo` , `M`, `N`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc`| Compute a scalar-matrix-matrix product and add the result to a scalar-matrix product, where one of the matrices in the multiplication is symmetric. |\n| `_trsm` | `sb_handle`, `side`, `uplo`, `trans`, `diag`, `M`, `N`, `alpha`, `mA`, `lda`, `mB`, `ldb` | Triangular solve with Multiple Right-Hand Sides. |\n\n### EXTENSION\n\nThe following table sums up the interface that can be found in\n[extension_interface.h](include/interface/extension_interface.h).\n\nFor all these operations:\n\n* `A`, `B` and `C` are containers for the column-major matrices A, B and C.\n* `lda`, `ldb` and `ldc` are the leading dimensions of the matrices A, B and C\n  (cf BLAS 2). The leading dimension of a matrix must be greater than or equal\n  to its number of rows. In the case of in-place copy/transpose, the same matrix `A`\n  is used with two different leading dimensions for input \u0026 output.\n* `stride_a`, `stride_b` and `stride_c` are the striding size between consecutive \nmatrices in a batched entry for the inputs/outputs. \n* `inc_a` and `inc_b` are the jump-count between consecutive elements in A \u0026 B matrices.\n* `transa` and `transb` are the transpose modes of the matrices A and B\n  (cf BLAS 2).\n* `M` and `N` are the dimensions of the matrices (Rows and Columns respectively).\n* `alpha` and `beta` are scaling scalars.\n* `batch_size` is the number of batch matrices.\n\n| operation | arguments | description |\n|---|---|---|\n| `_axpy_batch` | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `stride_x`, `vy`, `incy`, `stride_y`, `batch_size` | Perform multiple axpy operators in batch |\n| `_omatcopy` | `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `B`, `ldb`  | Perform an out-of-place scaled matrix transpose or copy operation using a general dense matrix. |\n| `_omatcopy2`| `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `inc_a`, `B`, `ldb`, `inc_b`  | Computes two-strided scaling and out-of-place transposition or copying of general dense matrices. |\n| `_omatadd`| `sb_handle`, `transa`, `transb`, `M`, `N`, `alpha`, `A`, `lda`, `beta`, `B`, `ldb`, `C`,`ldc`  | Computes scaled general dense matrix addition with possibly transposed arguments. |\n| `_omatcopy_batch` | `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `stride_a`, `B`, `ldb`, `stride_b`, `batch_size` | Perform an out-of-place scaled batched-strided matrix transpose or copy operation using a general dense matrix. |\n| `_imatcopy_batch` | `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `ldb`, `stride`, `batch_size` | Perform an in-place scaled batched-strided matrix transpose* or copy operation using a general dense matrix. (*: Currently the transpose case is not supported). |\n| `_omatadd_batch`| `sb_handle`, `transa`, `transb`, `M`, `N`, `alpha`, `A`, `lda`, `stride_a`, `beta`, `B`, `ldb`, `stride_b`, `C`,`ldc`, `stride_c`, `batch_size`  | Computes a batch of scaled general dense matrix addition with optionally transposed arguments. |\n\nOther non-official extension operators : \n| operation | arguments | description |\n|---|---|---|\n| `_transpose` | `sb_handle`, `M`, `N`, `A`, `lda`, `B`, `ldb`  | Computes an out-of-place matrix transpose operation using a general dense matrix. |\n| `_transpose*` | `sb_handle`, `M`, `N`, `A`, `lda`, `ldb`  | Computes an in-place matrix transpose operation using a general dense matrix, lda \u0026 ldb being input and output leading dimensions of A respectively _(*Not implemented)_. |\n### Experimental Joint Matrix Support\n\nportBLAS now supports sub-group based collective GEMM operation using the experimental \n[`joint_matrix`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc) extension provided by DPC++. This support is only accessible for the latest \nNVIDIA Ampere GPUs and beyond. The requirements for using this experimental support \nare: \n```bash\nDPCPP_SYCL_TARGET = \"nvptx64-nvidia-cuda\"\nDPCPP_SYCL_ARCH = \"sm_80\" | \"sm_90\"\n```\nTo invoke the `joint_matrix` based GEMM, you need to set the following environment variable:\n```bash\nexport SB_ENABLE_JOINT_MATRIX=1\n```\nThe user should expect erroneous behaviour from the code if both of these requirements are not met.\n\n## Requirements\n\nportBLAS is designed to work with any SYCL implementation.\nWe do not use any OpenCL interoperability, hence, the code is pure C++.\nThe project is developed using [DPCPP open source](https://github.com/intel/llvm)\nor [oneapi release](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html#gs.2iaved),\nusing Ubuntu 22.04 on Intel OpenCL CPU, Intel GPU, NVIDIA GPU and AMD GPU.\nThe build system is CMake version 3.4.3 or higher.\n\nA BLAS library, such as [OpenBLAS](https://github.com/xianyi/OpenBLAS), is also\nrequired to build and verify the test results. \nInstructions for building and installing\nOpenBLAS can be found [on this page](https://github.com/xianyi/OpenBLAS/wiki/User-Manual). \nPlease note that although some distributions may provide packages for OpenBLAS \nthese versions are typically quite old and may have issues with the TRMV implementation \nwhich can cause random test failures. Any version of OpenBLAS `\u003e= 0.3.0` will not suffer\nfrom these issues.\n\nWhen using OpenBLAS or any other BLAS library the installation directory must be\nadded to the `CMAKE_PREFIX_PATH` when building portBLAS (see\n[below](###cmake-options)).\n\n## Setup\n\n**IMPORTANT NOTE:** The `TARGET` CMake variable is no longer supported. It has\nbeen replaced by `TUNING_TARGET`, which accepts the same options.\n`TUNING_TARGET` affects only the tuning configuration and has no effect on the target\ntriplet for DPC++ or the AdaptiveCpp/hipSYCL target. Please refer to the sections \nbelow for setting them.\n\n1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order \nto clone submodule(s).\n2. Create a build directory\n3. Run `CMake` from the build directory *(see options in the section below)*:\n\n### Compile with DPC++\n```bash\nexport CXX=[path/to/intel/icpx]\ncd build\ncmake -GNinja ../ -DSYCL_COMPILER=dpcpp\nninja\n```\nThe target triplet can be set by adding `-DDPCPP_SYCL_TARGET=\u003ctriplet\u003e`. If it\nis not set, the default values is `spir64`, which compiles for generic SPIR-V\ntargets.\n\nOther possible triplets are `nvptx64-nvidia-cuda`, and\n`amdgcn-amd-amdhsa` for compiling for NVIDIA and AMD GPUs. In this case, it is\nadvisable for NVIDIA and **mandatory for AMD** to provide the specific device\narchitecture through `-DDPCPP_SYCL_ARCH=\u003carch\u003e`, e.g., `\u003carch\u003e` can be `sm_80`\nfor NVIDIA or `gfx908` for AMD.\n\nIt is possible to use the `DEFAULT` target even for AMD and NVIDIA GPUs, but\ndefining `-DDPCPP_SYCL_TARGET` and `-DDPCPP_SYCL_ARCH` is mandatory. The rules\nmentioned above also apply in this case.\nUsing `DEFAULT` as the target will speedup compilation at the expense of\nruntime performance. Additionally, some operators will be disabled.\nFor full compatibility and best performance, set the `TUNING_TARGET` appropriately.\n\n#### DPC++ Compiler Support\n\nAs DPCPP SYCL compiler the project is fully compatible with `icpx` provided by\nintel [oneAPI base-toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html#gs.7t6x52)\nwhich is the suggested one. PortBLAS can be compiled also with the [open source intel/llvm](https://github.com/intel/llvm)\ncompiler, but not all the latest changes are tested.\n\n### Compile with AdaptiveCpp *(Formerly hipSYCL)*\nThe following instructions concern the **generic** *(clang-based)* flow supported\nby AdaptiveCpp.\n\n```bash\ncd build\nexport CC=[path/to/system/clang]\nexport CXX=[path/to/AdaptiveCpp/install/bin/acpp]\nexport ACPP_TARGETS=[compilation_flow:target] # (e.g. cuda:sm_75)\ncmake -GNinja ../ -DAdaptiveCpp_DIR=/path/to/AdaptiveCpp/install/lib/cmake/AdaptiveCpp \\\n      -DSYCL_COMPILER=adaptivecpp -DACPP_TARGETS=$ACPP_TARGETS\nninja\n```\nTo build for other than the default backend *(host cpu through `omp`*)*, set the `ACPP_TARGETS` environment\nvariable or specify `-DACPP_TARGETS` as \n[documented](https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/using-hipsycl.md). \nThe available backends are the ones built with AdaptiveCpp in the first place.  \n\nSimilarly to DPCPP's `sycl-ls`, AdaptiveCpp's `acpp-info` helps display the available\nbackends informations. In case of building AdaptiveCpp against llvm *(generic-flow)*,\nthe `llvm-to-xxx.so` library files should be visible by the runtime to target the \nappropriate device, which can be ensured by setting the ENV variable : \n\n```bash\nexport LD_LIBRARY_PATH=[path/to/AdaptiveCpp/install/lib/hipSYCL:$LD_LIBRARY_PATH]\nexport LD_LIBRARY_PATH=[path/to/AdaptiveCpp/install/lib/hipSYCL/llvm-to-backend:$LD_LIBRARY_PATH]\n```\n\n*Notes :*\n- Some operator kernels are implemented using extensions / SYCL 2020 features not yet implemented \nin AdaptiveCpp and are not supported when portBLAS is built with it. These operators include \n`asum`, `nrm2`, `dot`, `sdsdot`, `rot`, `trsv`, `tbsv` and `tpsv`.\n- The default `omp` host CPU backend *(as well as its optimized variant `omp.accelerated`)* hasn't been\nnot been fully integrated into the library and currently causes some tests to fail *(interleaved batched\ngemm in particular)*. It's thus advised to use the llvm/OpenCL generic flow when targetting CPUs.\n\n### Installing portBLAS\nTo install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below)\n\n```bash\nninja install\n```\n\n### Doxygen\n\nDoxygen documentation can be generated by running:\n\n```bash\ndoxygen doc/Doxyfile\n```\n\n### CMake options\n\nCMake options are given using `-D` immediately followed by the option name, the\nsymbol `=` and a value (`ON` and `OFF` can be used for boolean options and are\nequivalent to 1 and 0). Example: `-DBLAS_ENABLE_TESTING=OFF`\n\nSome of the supported options are:\n\n| name | value | description |\n|---|---|---|\n| `BLAS_ENABLE_TESTING` | `ON`/`OFF` | Set it to `OFF` to avoid building the tests (`ON` is the default value) |\n| `BLAS_ENABLE_BENCHMARK` | `ON`/`OFF` | Set it to `OFF` to avoid building the benchmarks (`ON` is the default value) |\n| `SYCL_COMPILER` | name | Used to determine which SYCL implementation to use. By default, the first implementation found is used. Supported values are: `dpcpp` and  `adaptivecpp`. |\n| `TUNING_TARGET` | name | By default, this flag is set to `DEFAULT` to restrict any device specific compiler optimizations. Use this flag to tune the code for a target (**highly recommended** for performance). The supported targets are: `INTEL_GPU`, `NVIDIA_GPU`, `AMD_GPU` |\n| `CMAKE_PREFIX_PATH` | path | List of paths to check when searching for dependencies |\n| `CMAKE_INSTALL_PREFIX` | path | Specify the install location, used when invoking `ninja install` |\n| `BUILD_SHARED_LIBS` | `ON`/`OFF` | Build as shared library (`ON` by default) |\n| `ENABLE_EXPRESSION_TESTS` | `ON`/`OFF` | Build additional tests that use the header-only framework (e.g to test expression trees); `OFF` by default |\n| `ENABLE_JOINTMATRIX_TESTS` | `ON`/`OFF` | Build additional tests that use joint_matrix extension; `OFF` by default |\n| `BLAS_VERIFY_BENCHMARK` | `ON`/`OFF` | Verify the results of the benchmarks instead of only measuring the performance. See the documentation of the benchmarks for more details. `ON` by default |\n| `BLAS_MEMPOOL_BENCHMARK` | `ON`/`OFF` |  Determines whether to enable the scratchpad memory pool for benchmark execution. `OFF` by default |\n| `BLAS_ENABLE_CONST_INPUT` | `ON`/`OFF` | Determines whether to enable kernel instantiation with const input buffer (`ON` by default) |\n| `BLAS_ENABLE_EXTENSIONS` | `ON`/`OFF` | Determines whether to enable portBLAS extensions (`ON` by default) |\n| `BLAS_DATA_TYPES` | `float;double` | Determines the floating-point types to instantiate BLAS operations for. Default is `float`. Enabling other types such as complex or half requires setting their respective options *(next)*. |\n| `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Operators only)* (`OFF` by default) |\n| `BLAS_ENABLE_HALF` | `ON`/`OFF` | Determines whether to enable Half data type support *(Support is limited to some Level 1 operators and Gemm)* (`OFF` by default) |\n| `BLAS_INDEX_TYPES` | `int32_t;int64_t` | Determines the type(s) to use for `index_t` and `increment_t`. Default is `int` |\n\n## Tests and benchmarks\n\nThe tests and benchmarks have their own documentation:\n\n- [Documentation of the tests](test/README.md)\n- [Documentation of the benchmarks](benchmark/README.md)\n\n\n## Contributing to the project\n\nportBLAS is an Open Source project maintained by the HPCA group and\nCodeplay Software Ltd.\nFeel free to create an issue on the Github tracker to request features or\nreport bugs.\n\n### Guides and Other Documents\n\n- [How to add a new operation](doc/AddingBlas3Op.md)\n- [Autotuner Developer Guide](doc/Autotuner.md)\n- [Missing Features](doc/MissingFeatures.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodeplaysoftware%2Fportblas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodeplaysoftware%2Fportblas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodeplaysoftware%2Fportblas/lists"}