{"id":27011659,"url":"https://github.com/eth-cscs/costa","last_synced_at":"2025-04-04T11:36:24.723Z","repository":{"id":41354085,"uuid":"308129692","full_name":"eth-cscs/COSTA","owner":"eth-cscs","description":"Distributed Communication-Optimal Shuffle and Transpose Algorithm","archived":false,"fork":false,"pushed_at":"2023-12-22T10:32:28.000Z","size":1318,"stargazers_count":11,"open_issues_count":2,"forks_count":2,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-05-01T18:15:44.002Z","etag":null,"topics":["distributed","hpc","mpi","openmp","pdgemr2d","pdtran","pztranc","pztranu","redistribute","scalapack","transpose"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eth-cscs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-10-28T20:11:41.000Z","updated_at":"2023-10-05T03:19:18.000Z","dependencies_parsed_at":"2023-12-30T15:48:53.045Z","dependency_job_id":null,"html_url":"https://github.com/eth-cscs/COSTA","commit_stats":{"total_commits":124,"total_committers":8,"mean_commits":15.5,"dds":0.2661290322580645,"last_synced_commit":"bb84528d023db9a6b00ad729fb44b8c3cef8c981"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FCOSTA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FCOSTA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FCOSTA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FCOSTA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eth-cscs","download_url":"https://codeload.github.com/eth-cscs/COSTA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247173256,"owners_count":20896053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed","hpc","mpi","openmp","pdgemr2d","pdtran","pztranc","pztranu","redistribute","scalapack","transpose"],"created_at":"2025-04-04T11:36:24.049Z","updated_at":"2025-04-04T11:36:24.715Z","avatar_url":"https://github.com/eth-cscs.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\u003cimg src=\"./docs/costa-logo.svg\" width=\"55%\"\u003e\u003c/p\u003e\n\n## Table of Contents\n- [Overview](#overview)\n- [Publication](#publication)\n- [Features](#features)\n- [Installing in 30 seconds](#installing-in-30-seconds)\n- [Examples](#examples)\n    - [Block-cyclic (Scalapack) Layout](#block-cyclic-scalapack-layout)\n    - [Custom (Arbitrary) Layout](#custom-arbitrary-layout)\n    - [Initializing Layouts](#initializing-layouts)\n    - [Transforming Matrix Layouts](#transforming-matrix-layouts)\n    - [Scalapack Wrappers](#scalapack-wrappers)\n- [Advanced Features](#advanced-features)\n    - [Transforming Multiple Layouts](#transforming-multiple-layouts)\n    - [Achieving Communication-Optimality](#achieving-communication-optimality)\n- [Performance Results](#performance-results)\n- [COSTA in Production](#costa-in-production)\n- [Miniapps (for testing and benchmarking)](#miniapps-for-testing-and-benchmarking)\n    - [Data-redistribution with pxgemr2d](#data-redistribution-with-pxgemr2d)\n    - [Scale and Transpose with pxtran and pxtranu](#scale-and-transpose-with-pxtran-and-pxtranu)\n    - [Communication Volume Reduction](#communication-volume-reduction)\n- [Questions?](#questions)\n- [Acknowledgements](#acknowledgements)\n\n## Overview\n\nCOSTA is a communication-optimal, highly-optimised algorithm for data redistribution accross multiple processors, using `MPI` and `OpenMP` and offering the possibility to transpose and scale some or all data. It implements scalapack routines for matrix scale \u0026 transpose operations (`sub(C) = alpha * sub(A)^T + beta * C`, provided by `pxtran(u)`) and data redistribution (`sub(C) = sub(A)`, provided by `pxgemr2d`) and outperforms other scalapack implementations by orders of magnitude in some cases. Unlike previous redistribution algorithms, COSTA will also propose the relabelling of MPI ranks that minimizes the data reshuffling cost, leaving to users to decide if they want to use it. This way, if the initial and the target data distributions differ up to a rank permutation, COSTA will perform no communication, whereas other algorithms will reshuffle all the data. Thanks to its optimizations, significant speedups will be achieved even if the proposed rank relabelling is not used.\n\nWhat makes COSTA more general than scalapack routines is that it is not limited only to block-cyclic data distributions, but can deal with completely arbitrary and irregular matrix distributions and can be easily generalized for n-dimensional tensors. \n\nThanks to its scalapack wrappers, scalapack users do not need to change their code in order to use COSTA: it is enough to link your library to COSTA before linking to scalapack and all `pxtran, pxtranu` and `pxgemr2d` routines will automatically be using the COSTA algorithm.\n\n## Publication\n\nThis work is published in the **Proceedings of the International Conference on High Performance Computing (ISC21)** and is available under the following links:\n- **published version:** https://link.springer.com/chapter/10.1007/978-3-030-78713-4_12\n- **arxiv preprint:** https://arxiv.org/abs/2106.06601\n\nIt can be cited as:\n```\n@InProceedings{costa_algorithm_2021,\n    author=\"Kabi{\\'{c}}, Marko and Pintarelli, Simon and Kozhevnikov, Anton and VandeVondele, Joost\",\n    editor=\"Chamberlain, Bradford L. and Varbanescu, Ana-Lucia and Ltaief, Hatem and Luszczek, Piotr\",\n    title=\"COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling\",\n    booktitle=\"High Performance Computing\",\n    year=\"2021\",\n    publisher=\"Springer International Publishing\",\n    address=\"Cham\",\n    pages=\"217--236\",\n    isbn=\"978-3-030-78713-4\"\n}\n```\n\n## Features\n\nCOSTA has the following features:\n- **scale, transpose \\\u0026 reshuffle:** apart from redistribution, can also transpose, scale and sum initial and final layouts:\n```\nsub(B) = beta * sub(B) + alpha * sub(op(A)) ; op=N, T or C; sub = submatrix \n```\n- **Arbitrary Layouts:** COSTA is not limited to block cyclic matrix layouts and can handle complitely irregular and arbitrary matrix distributions.\n- **Multiple Layouts:** can transform multiple layouts at once (in the same communication round).\n- **Highly-optimized:** it is highly-optimized in distributed and multithreaded settings.\n- **Communication-Optimal:** proposes (but does not enforce) the optimal rank relabelling to minimize communication.\n- **SCALAPACK wrappers:** provides scalapack wrappers for `pxgemr2d` and `pxtran(u)`.\n- **Well Documented:** a detailed documentation is provided in this README.\n\n## Installing in 30 seconds\n\nPlease refer to [INSTALL.md](INSTALL.md).\n\n## Examples\n\n### Block-cyclic (Scalapack) Layout\n\nTo represent an arbitrary block-cyclic (scalapack) layout, we can use the following function defined in `costa/layout.hpp` header:\n```cpp\n#include \u003ccosta/layout.hpp\u003e\n// ...\ntemplate \u003ctypename T\u003e\ngrid_layout\u003cT\u003e costa::block_cyclic_layout\u003cdouble\u003e(\n                   const int m, const int n,         // global matrix dimensions\n                   const int b_m, const int b_n,     // block dimensions\n                   const int i, const int j,         // submatrix start\n                                                     // (1-based, scalapack-compatible)\n                   const int sub_m, const int sub_n, // submatrix size\n                   const int p_m, const int p_n,     // processor grid dimension\n                   const char order,                 // rank grid ordering ('R' or 'C')\n                   const int rsrc, const int csrc,   // coordinates of ranks oweing \n                                                     // the first row (0-based)\n                   T* ptr,                           // local data of matrix A \n                                                     // (not the submatrix)\n                   const int lld,                    // local leading dimension\n                   const char data_ordering,         // 'R' or 'C' depending on whether \n                                                     // each local block\n                                                     // is given in row- or col-major\n                                                     // ordering\n                   const int rank                    // processor rank\n               );\n```\nThe arguments can be nicely visualized with the following figure, where the red submatrix is represented:\n\u003cp align=\"center\"\u003e\u003cimg src=\"./docs/block-cyclic.svg\" width=\"100%\"\u003e\u003c/p\u003e\n\nIn case we want to represent the full matrix (instead of a submatrix), it suffices to put:\n```cpp\n// start of the submatrix is the start of the full matrix\nint i = 1; int j = 1 // 1-based due to scalapack-compatibility\n// size of the submatrix is that size of the full matrix\nint sub_m = m; int sub_n = n\n```\n\nFor a complete example of transforming between two block-cyclic matrix layouts, please refer to `examples/example0.cpp`.\n\n### Custom (Arbitrary) Layout\n\nTo represent an arbitrary grid-like layout, we can use the following function defined in `costa/layout.hpp` header:\n```cpp\n#include \u003ccosta/layout.hpp\u003e\n// ...\ntemplate \u003ctypename T\u003e\ngrid_layout\u003cT\u003e costa::custom_layout(\n                   int rowblocks,           // number of global blocks (N_rb)\n                   int colblocks,           // number of global blocks (N_cb)\n                   int* rowsplit,           // [rowsplit[i], rowsplit[i+1]) is range of rows of block i\n                   int* colsplit,           // [colsplit[i], colsplit[i+1]) is range of columns of block i\n                   int* owners,             // owners[i][j] is the rank owning block (i,j). \n                                            // Owners are given in row-major order as assumed by C++.\n                   int nlocalblocks,        // number of blocks owned by the current rank (N_lb)\n                   block_t* localblocks,    // an array of block descriptions for the current rank\n                   const char data_ordering // 'R' or 'C' depending on whether each\n                                            // local block is given in row- or col-major\n                                            // order\n               );\n```\nwhere `block_t` is a simple struct defined in the same header:\n```cpp\n// each local block is assumed to be stored in col-major order\nstruct costa::block_t {\n    void *data; // a pointer to the start of the local matrix\n    int ld;     // leading dimension or distance between two consecutive local columns\n    int row;    // the global block row index\n    int col;    // the global block colum index\n};\n```\n\nThe arguments can be nicely visualized with the following figure:\n\u003cp align=\"center\"\u003e\u003cimg src=\"./docs/custom-layout.svg\" width=\"90%\"\u003e\u003c/p\u003e\n\nFor a complete example of transforming between a block-cyclic and a custom matrix layout, please refer to `examples/example1.cpp`.\n\n### Initializing Layouts\n\nOnce the layouts are created as previously described, we can initialize them by providing a simple lambda function that maps global element coordinates `(i,j)` to the value to which the element should be initialized:\n\n```cpp\n// some user-defined layout\ngrid_layout\u003cdouble\u003e layout; \n\n// function f(i, j) := value of element (i, j) in the global matrix\n// an arbitrary function\nauto f = [](int i, int j) -\u003e double {\n    return i + j; \n};\n\n// initialize it\nlayout.initialize(f);\n```\n\nIn exactly the same way, we can check if the elements in the layout are equal to the values provided by the lambda function:\n```cpp\n// check if the values in the final layout correspond to function f\n// that was used for the initialization of the initial layout\nbool ok = layout.validate(f, 1e-12); // the second argument is the tolerance\n```\n\n### Transforming Matrix Layouts\n\nOnce the layouts are created as previously described, we can transform between two layouts in the following way (defined in header `\u003ccosta/grid2grid/transform.hpp\u003e`):\n\n- Redistribute with optional scaling and/or transpose. Performs `B = beta * B + alpha * op(A)`, where `op` can be transpose, conjugate or none (i.e. identity).\n```cpp \n#include \u003ccosta/grid2grid/transform.hpp\u003e\n// ...\n// redistribute a single layout with scaling:\n// final_layout = beta * final_layout + alpha * initial_layout\ntemplate \u003ctypename T\u003e\nvoid transform(grid_layout\u003cT\u003e \u0026initial_layout, // initial layout, A\n               grid_layout\u003cT\u003e \u0026final_layout,   // final layout, B\n               const char trans,               // defines operation on A, op(A), can be:\n                                               // 'N' for none, i.e. identity, \n                                               // 'T' for transpose\n                                               // 'C' for conjugate\n               const T alpha, const T beta,    // defines the scaling parameters alpha and beta\n               MPI_Comm comm);                 // MPI communicator containing at least \n                                               // a union of communicators of both (initial and final) layouts\n```\n\nObserve that matrices do not necessarily have to be distributed over the same communicator. But the communicator passed to this function, must be at least a union of communicators containing each (initial and final) matrices.\n\nFor complete examples please refer to `examples`.\n\n### Scalapack Wrappers\n\nIf installed with cmake option `COSTA_SCALAPACK` (e.g. with `cmake -DCOSTA_SCALAPACK=MKL ..`, which can also have values `CRAY_LIBSCI` or `CUSTOM`), then also the scalapack wrappers will be available for `pxgemr2d` (redistribute), `pxtran(u)` (transpose) and `pxtran(c)` (conjugate-transpose) routines. \n\nFor this purpose, the following libraries are available:\n- **`costa_scalapack`:** implements the following scalapack routines:\n    \u003e `pdgemr2d`, `psgemr2d`, `pcgemr2d`, `pzgemr2d`\n\n    \u003e `pstran`, `pdtran`, `pctranu`, `pztranu`, `pctranc`, `pztranc`\n\n    In this case, it is enough to link your library to **`costa_scalapack`** before linking to scalapack and these functions will be overwritten by the COSTA implementation. Therefore, if you code is already using scalapack, there is no need to change your code, just linking is enough!\n\n- **`prefixed_costa_scalapack`:** implements the following routines (same as above, but with a `costa_` prefix):\n    \u003e `costa_pdgemr2d`, `costa_psgemr2d`, `costa_pcgemr2d`, `costa_pzgemr2d`\n\n    \u003e `costa_pstran`, `costa_pdtran`, `costa_pctranu`, `costa_pztranu`, `costa_pctranc`, `costa_pztranc`\n\n    This way, you can keep scalapack implementation, and at the same time have the COSTA-implementation as well!\n\n## Advanced Features\n\n### Transforming Multiple Layouts\n\nIf multiple layouts should be transformed, COSTA is able to transform all of them at once, in the same communication round! This can be done using the `transformer` class defined in `costa/grid2grid/transformer.hpp\u003e`, as illustrated below:\n\n```cpp\n#include \u003ccosta/grid2grid/transformer.hpp\u003e\n#include \u003ccosta/layout.hpp\u003e\n// ...\n// a user-defined MPI communicator\nMPI_Comm comm = MPI_COMM_WORLD;\n\n// *******************************\n// user-defined layouts\n// *******************************\ngrid_layout\u003cdouble\u003e A1, B1;\ngrid_layout\u003cdouble\u003e A2, B2;\n\n// *******************************\n// transforming A1-\u003eB1 and A2-\u003eB2\n// *******************************\nchar trans = 'N'; // do not transpose\ndouble alpha = 1.0; // do not scale initial layouts\ndouble beta = 0.0; // (do not scale final layouts\n\n// create the transformer class\ncosta::transformer\u003cdouble\u003e transf(comm);\n\n// schedule A1-\u003eB1\ntransf.schedule(A1, B1, trans, alpha, beta);\n\n// schedule A2-\u003eB2\ntransf.schedule(A2, B2, trans, alpha, beta);\n\n// trigger the transformations\ntransf.transform();\n```\n\nThis is more efficient than transforming each of those separately, because all layouts are transformed within the same communication round. However, it might use more memory because the messages might be larger. \n\n### Achieving Communication-Optimality\n\nIn order to achieve communication-optimality, we would need to use the rank relabelling, that will be described step-by-step below.\n\nSo far, we have been only using `costa::grid_layout\u003cT\u003e` objects and we showed how we can transform between different layouts. This object contains two important pieces of information: the global matrix grid and also the local blocks for the current rank. The global matrix grid object is called `costa::assigned_grid2D` and is a simpler object than the layout, since it does not contain any information about the local data. For illustration purposes, we could write: `layout = grid + local_data`, or translated to classes, we could write: `grid_layout\u003cT\u003e = assigned_grid2D + local_data\u003cT\u003e`.\n\nThe global matrix grid (`costa::assigned_grid2D`) can be created in the same way as the layout object, we only need to exclude the information about the local data:\n- block-cyclic grid can be created using the function:\n```cpp\n#include \u003ccosta/layout.hpp\u003e\n// ...\ntemplate \u003ctypename T\u003e\nassigned_grid2D costa::block_cyclic_grid\u003cdouble\u003e(\n                    const int m, const int n,         // global matrix dimensions\n                    const int b_m, const int b_n,     // block dimensions\n                    const int i, const int j,         // submatrix start\n                                                      // (1-based, scalapack-compatible)\n                    const int sub_m, const int sub_n, // submatrix size\n                    const int p_m, const int p_n,     // processor grid dimension\n                    const char order,                 // rank grid ordering ('R' or 'C')\n                    const int rsrc, const int csrc,   // coordinates of ranks oweing \n                                                      // the first row (0-based)\n                );\n```\nObserve that this is the same as the `block_cyclic_layout` function, where the last three parameters are omitted. \n\n- custom grid\n```cpp\n#include \u003ccosta/layout.hpp\u003e\n// ...\n// contains only the global grid, without local data\ntemplate \u003ctypename T\u003e\nassigned_grid2D costa::custom_grid(\n                    int rowblocks, // number of global blocks (N_rb)\n                    int colblocks, // number of global blocks (N_cb)\n                    int* rowsplit, // [rowsplit[i], rowsplit[i+1]) is range of rows of block i\n                    int* colsplit, // [colsplit[i], colsplit[i+1]) is range of columns of block i\n                    int* owners,   // owners[i][j] is the rank owning block (i,j). \n                                   // Owners are given in row-major order as assumed by C++.\n               );\n```\nObserve that this is the same as the `custom_layout` function, where the last two parameters are omitted.\n\nIn order to propose the communication-optimal rank relabelling, COSTA first has to analyse the global grids in all transformations we want to perform. Therefore, the first step is to create the grid objects. \n\nAssume we want to transform `A1-\u003eB1` and `A2-\u003eB2`. In the first step, we create the grid objects:\n\n```cpp\n#include \u003ccosta/layout.hpp\u003e\n\n// create grids (arbitrary, user-defined)\nauto A1_grid = costa::custom_grid(...);\nauto B1_grid = costa::block_cyclic_grid(...);\n\nauto A2_grid = costa::block_cyclic_grid(...);\nauto B2_grid = costa::custom_grid(...);\n```\nNow we want COSTA to analyse these grids, by computing the necessary communication volume:\n```cpp\n// compute the comm volume for A1-\u003eB1\nauto comm_vol = costa::communication_volume(A1_grid, B1_grid);\n\n// add the comm volume for A2-\u003eB2\ncomm_volume += costa::communication_volume(A2_grid, B2_grid);\n```\n\nNext, we can get the optimal rank relabelling:\n```cpp\n#include \u003ccosta/grid2grid/ranks_reordering.hpp\u003e\n// ...\nbool reorder = false;\n// input parameters:\n// - comm_vol := communication volume object, created previously\n// - P := communicator size\n// output parameters:\n// - rank_relabelling: ranks permutation yielding communication-optimal transformation\n// - reordered: if true, the returned rank relabelling is not the identity permutation\nstd::vector\u003cint\u003e rank_relabelling = costa::optimal_reordering(comm_vol, P, \u0026reordered);\n```\n\nFinally, we can use this rank relabelling as follows:\n```cpp\n#include \u003ccosta/grid2grid/transformer.hpp\u003e\n// ...\n// get the current rank\nint rank;\nMPI_Comm_rank(comm, \u0026rank);\n\n// create the transformer object:\ncosta::transformer\u003cT\u003e transf(comm);\n\ncreate full layout objects\nauto A1 = costa::custom_layout(...); // local blocks should correspond to rank `rank`\nauto B1 = costa::block_cyclic_layout(...); // local blocks should correspond to rank `rank_relabelling[rank]`\n\n// schedule A1-\u003eB1\ntransf.schedule(A1, B1); // trans, alpha and beta parameters are optional\n\nauto A2 = costa::block_cyclic_layout(...); // local blocks should correspond to rank `rank`\nauto B2 = costa::custom_layout(...); // local blocks should correspond to rank `rank_relabelling[rank]`\n\n// schedule A2-\u003eB2\ntransf.schedule(A2, B2); // trans, alpha and beta parameters are optional\n\n// trigger the transformations which are now communication optimal\ntransf.transform();\n```\n\n## Performance Results\n\nThe performance of COSTA was compared with MKL SCALAPACK v19.1 on the [Piz Daint supercomputer](https://www.cscs.ch/computers/piz-daint/) (Cray XC40) from Swiss National Supercomputing Centre (CSCS). To make a fair comparison, we compared the performance of the scalapack routine `pdgemr2d` redistributing the matrices between different layouts, that is also provided by COSTA. In addition, we did not use communication-optimal rank relabelling in COSTA nor hidden memory pools or memory resuse between the calls. The benchmark code is available in the provided [miniapp](#data-redistribution-with-pxgemr2d).\n\nWe ran the benchmark on `8` nodes (each having 36 cores) and `16=4x4` MPI ranks. The square matrices are used which sizes were varied. When both initial and final matrices had exactly the same block-cyclic layout, with block sizes being 128x128, the following results have been achieved:\n\u003cp align=\"center\"\u003e\u003cimg src=\"./docs/costa-same.svg\" width=\"70%\"\u003e\u003c/p\u003e\n\nWhen initial and final layouts had different block sizes, i.e. the initial block sizes are `36x36` whereas the final block size is `128x128`, then the following results have been achieved:\n\u003cp align=\"center\"\u003e\u003cimg src=\"./docs/costa-diff.svg\" width=\"70%\"\u003e\u003c/p\u003e\n\nTherefore, COSTA is highly-optimised even when no rank relabelling is used. If rank relabelling was used, even further speedups would be possible.\n\n## COSTA in Production\n\nCOSTA is used by communication-optimal matrix-multiplication algorithm [COSMA](https://github.com/eth-cscs/COSMA) which is used in the Quantum Chemistry Simulator [CP2K](https://www.cp2k.org).\n\n## Miniapps (for testing and benchmarking)\n\n### Data-redistribution with pxgemr2d\n\nCOSTA implements ScaLAPACK `pxgemr2d` routines that transforms the matrix between two block-cyclic data layouts (`sub(C) = sub(A)`) where the two matrices do not necessarily have to belong to the same MPI communicators. In addition, COSTA will propose the MPI rank relabelling that minimizes the data reshuffling cost and that user is free to choose whether to use it. \n\nThe miniapp consists of an executable [`./build/miniapps/pxgemr2d_miniapp`](https://github.com/eth-cscs/COSTA/blob/master/miniapps/pxgemr2d_miniapp.cpp) which can be run as follows (assuming we are in the root folder of the project):\n\n```bash\n# set the number of threads to be used by each MPI rank\nexport OMP_NUM_THREADS=18\n# if using CPU version with MKL backend, set MKL_NUM_THREADS as well\nexport MKL_NUM_THREADS=18 \n# run the miniapp\nmpirun -np 4 ./build/miniapps/pxgemr2d_miniapp -m 1000 -n 1000 \\\n                                               --block_a=128,128 \\ \n                                               --block_c=128,128 \\\n                                               --p_grid_a=2,2 \\\n                                               --p_grid_c=2,2 \\\n                                               --type=double \\\n                                               --algorithm=costa\n```\n\nThe overview of all supported options is given below:\n- `-m (--m_dim)` (default: `1000`): number of rows of matrices `A` and `C`.\n- `-n (--n_dim)` (default: `1000`): number of columns of matrices `A` and `C`. \n- `--block_a` (optional, default: `128,128`): 2D-block size for matrix A. \n- `--block_c` (optional, default `128,128`): 2D-block size for matrix C.\n- `-p (--p_grid_a)` (optional, default: `1,P`): 2D-processor grid for matrix A. By default `1xP` where `P` is the total number of MPI ranks.\n- `-q (--p_grid_c)` (optional, default: `1,P`): 2D-processor grid for matrix C. By default `1xP` where `P` is the total number of MPI ranks.\n- `-r (--n_rep)` (optional, default: 2): number of repetitions.\n- `-t (--type)` (optional, default: `double`): data type of matrix entries. Can be one of: `float`, `double`, `zfloat` and `zdouble`. The last two correspond to complex numbers.\n- `--test` (optional): if present, the result of COSTA will be verified with the result of the available SCALAPACK.\n- `--algorithm` (optional, default: `both`): defines which algorithm (`costa`, `scalapack` or `both`) to run.\n- `-h (--help) (optional)`: print available options.\n\n### Scale and Transpose with pxtran and pxtranu\n\nCOSTA implements ScaLAPACK `pxtran` and `pxtranu` routines that performs the scale and transpose operation, given by:\n```sub(C) = alpha * sub(A)^T + beta * sub(C)```\nIn addition, COSTA will propose the MPI rank relabelling that minimizes the data reshuffling cost and that user is free to choose whether to use it. \n\nThe miniapp consists of an executable [`./build/miniapps/pxtran_miniapp`](https://github.com/eth-cscs/COSTA/blob/master/miniapps/pxtran_miniapp.cpp) which can be run as follows (assuming we are in the root folder of the project):\n\n```bash\n# set the number of threads to be used by each MPI rank\nexport OMP_NUM_THREADS=18\n# if using CPU version with MKL backend, set MKL_NUM_THREADS as well\nexport MKL_NUM_THREADS=18 \n# run the miniapp\nmpirun -np 4 ./build/miniapps/pxtran_miniapp -m 1000 -n 1000 -k 1000 \\\n                                             --block_a=128,128 \\ \n                                             --block_c=128,128 \\\n                                             --p_grid=2,2 \\\n                                             --alpha=1 \\\n                                             --beta=1 \\\n                                             --type=double \\\n                                             --algorithm=costa\n```\n\nThe overview of all supported options is given below:\n- `-m (--m_dim)` (default: `1000`): number of rows of matrices `A` and `C`.\n- `-n (--n_dim)` (default: `1000`): number of columns of matrices `A` and `C`. \n- `--block_a` (optional, default: `128,128`): 2D-block size for matrix A. \n- `--block_c` (optional, default `128,128`): 2D-block size for matrix C.\n- `-p (--p_grid)` (optional, default: `1,P`): 2D-processor grid. By default `1xP` where `P` is the total number of MPI ranks.\n- `--alpha` (optional, default: 1): alpha parameter in `sub(C) = alpha*sub(A)^T + beta*sub(C)`.\n- `--beta` (optional, default: 0): beta parameter in `sub(C) = alpha*sub(A)^T + beta*sub(C)`.\n- `-r (--n_rep)` (optional, default: 2): number of repetitions.\n- `-t (--type)` (optional, default: `double`): data type of matrix entries. Can be one of: `float`, `double`, `zfloat` and `zdouble`. The last two correspond to complex numbers.\n- `--test` (optional): if present, the result of COSTA will be verified with the result of the available SCALAPACK.\n- `--algorithm` (optional, default: `both`): defines which algorithm (`costa`, `scalapack` or `both`) to run.\n- `-h (--help) (optional)`: print available options.\n\n### Communication Volume Reduction\n\nMeasuring the total communication volume reduction (in \\%) that can be achieved by process relabeling can be done by running the [./build/miniapps/comm_volume](https://github.com/eth-cscs/COSTA/blob/master/miniapps/comm_volume.cpp) miniapp, without using `MPI`. The miniapps assumes a matrix with dimensions `m x n` is transformed between two block-cyclic layouts which are specified by block sizes and process grids. The suffix `_a` refers to the initial layout and the suffix `_b` refers to the target layout.\n\n```bash\n./build/miniapps/comm_volume -m 100000 -n 100000 \\\n                             --block_a=100,100 --p_grid_a=2,4 \\\n                             --block_c=100,100 --p_grid_c=4,2\noutput:\nComm volume reduction [%] = 33.3333\n```\n\n## Questions?\n\nFor questions, feel free to contact us at (marko.kabic@cscs.ch), and we will soon get back to you. \n\n## Acknowledgements\n\nThis work was funded in part by:  \n\n\u003cimg align=\"left\" height=\"50\" src=\"./docs/eth-logo.svg\"\u003e | [**ETH Zurich**](https://ethz.ch/en.html)**: Swiss Federal Institute of Technology in Zurich**\n| :------------------- | :------------------- |\n\u003cimg align=\"left\" height=\"50\" src=\"./docs/cscs-logo.jpg\"\u003e | [**CSCS**](https://www.cscs.ch)**: Swiss National Supercomputing Centre**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-cscs%2Fcosta","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feth-cscs%2Fcosta","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-cscs%2Fcosta/lists"}