{"id":27011656,"url":"https://github.com/eth-cscs/tiled-mm","last_synced_at":"2025-07-19T21:33:01.229Z","repository":{"id":42176916,"uuid":"181881173","full_name":"eth-cscs/Tiled-MM","owner":"eth-cscs","description":"Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs. ","archived":false,"fork":false,"pushed_at":"2025-04-02T07:19:08.000Z","size":776,"stargazers_count":33,"open_issues_count":2,"forks_count":10,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-06-20T22:05:02.015Z","etag":null,"topics":["amd","cublas","cublasxt","cuda","gpu","matmul","matrix-multiplication","nvidia","rocblas","rocblasxt","rocm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eth-cscs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-17T11:53:14.000Z","updated_at":"2025-06-15T14:24:39.000Z","dependencies_parsed_at":"2023-12-17T22:42:23.268Z","dependency_job_id":"909c9bae-37ff-47e6-959e-876239fce3db","html_url":"https://github.com/eth-cscs/Tiled-MM","commit_stats":{"total_commits":93,"total_committers":10,"mean_commits":9.3,"dds":0.4301075268817204,"last_synced_commit":"cde8084a06042f8ed27ce30b59903ae759312ac0"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/eth-cscs/Tiled-MM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FTiled-MM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FTiled-MM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FTiled-MM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FTiled-MM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eth-cscs","download_url":"https://codeload.github.com/eth-cscs/Tiled-MM/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eth-cscs%2FTiled-MM/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266019657,"owners_count":23864916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amd","cublas","cublasxt","cuda","gpu","matmul","matrix-multiplication","nvidia","rocblas","rocblasxt","rocm"],"created_at":"2025-04-04T11:36:23.705Z","updated_at":"2025-07-19T21:33:01.206Z","avatar_url":"https://github.com/eth-cscs.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Table of Contents\n- [Overview](#overview)\n- [Performance](#performance)\n- [Features](#features)\n- [Building and Installing](#building-and-installing)\n- [Minimal Working Example](#minimal-working-example)\n- [Running the Benchmarks](#running-the-benchmarks)\n- [Testing](#testing)\n- [Author](#author)\n\n\n## Overview\n\nTiled-MM is a very fast and easy-to-use library for multiplying matrices on GPU. As opposed to NVIDIA's `cublas`, this library takes pointer from the host side (CPU), splits the matrices into tiles, pipelines them efficiently to the GPU and copies the result back to the CPU. It can serve as almost a drop-in replacement for `cublasXt`, and is ported to both NVIDIA and AMD gpus.\n\nIt offers more features than the standard cublas API. For example, the user can specify the number of gpu streams to be used, as well as the tile size for each dimension separately, which is not possible with the standard cublas API.\n\nTiled-MM is used in production as a backend of the [COSMA](https://github.com/eth-cscs/COSMA) algorithm and is thus well-tested.\n\n## Performance\n\nThe benchmarks were performed on a single node of Piz Daint Supercomputer (Cray XC50), equipped with a `P100` NVIDIA GPU. We compared the performance of our library `Tiled-MM` with the vanilla version of `cublasXt` and also with the manually tuned version of `cublasXt`, where we manually set the tile size to `4000` and enabled the pinned memory mode. `Tiled-MM` was substantially faster than the vanilla version of `cublasXt`, and achieved similar performance as the manually tuned version of `cublasXt`, as can be seen from the results below.\n\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/eth-cscs/Tiled-MM/blob/master/docs/performance.svg\" width=\"90%\"\u003e\u003c/p\u003e\n\nIn the benchmark, we used `double precision`, `square matrices` given in `column-major` ordering, and `alpha = beta = 1.0`.\n\n## Features:\n\n- The user can specify the tile size of each dimension separately.\n- The user can specify the number of streams to be used.\n- The user can reuse the same context (and thus the same device memory) for many multiplications which can lead to significant performance improvements.\n- Fully templatized, supporting arbitrary data types.\n- Ported to both `NVIDIA` and `AMD` GPUs.\n\n## Building and Installing\n\nAssuming that you want to use the `gcc 8` compiler, you can build the project as follows:\n```bash\n# clone the repo\ngit clone https://github.com/eth-cscs/Tiled-MM\ncd Tiled-MM\nmkdir build\ncd build\n\n# build\nCC=gcc-8 CXX=g++-8 cmake -DTILEDMM_GPU_BACKEND=CUDA ..\n\n# compile\nmake -j 4\n```\n\nWhen building the examples [cxxopts](https://github.com/jarro2783/cxxopts) is required. It is available in most package manager, `apt-get install libcxxopts-dev` (ubuntu) or `brew install cxxopts` (macos).\n\nThe option `-DTILEDMM_GPU_BACKEND` can have the following values:\n- `CUDA`: for NVIDIA GPUs\n- `ROCM`: for AMD GPUs\n\n## Minimal Working Example\n\nUsing the library is very simple, just include `#include \u003ctiled_mm.hpp\u003e` and use it as follows:\n```cpp\n// A dimensions: m x k\nauto a_host = gpu::malloc_pinned\u003cdouble\u003e(m * k, 1);\n// B dimensions: k x n\nauto b_host = gpu::malloc_pinned\u003cdouble\u003e(k * n, 1);\n// C dimensions: m x n\nauto c_host = gpu::malloc_pinned\u003cdouble\u003e(m * n, 0);\n\ndouble alpha = 1.0;\ndouble beta = 0.0;\n\n// preallocates device buffers and other CUDA stuff\n// the context does not have to be created explicitly\n// so the user can omit this part\nauto ctx = gpu::make_context();\n\n// compute c = alpha * a * b + beta * c\n// There is also a version without ctx, in case the user\n// does not want to create the context explicitly\ngpu::gemm(*ctx,\n          trans_a, trans_b,\n          m, n, k,\n          alpha,\n          a_host, ld_a,\n          b_host, ld_b,\n          beta,\n          c_host, ld_c);\n\n// optionally, we can set the following two boolean flags\nbool pin_buffers = false; // since a_host, b_host and c_host are already pinned, gpu::dgemm should not pin them\nbool copy_c_back = true;  // if we want to copy the result back to the host or leave it on the gpu\ngpu::gemm(*ctx,\n          trans_a, trans_b,\n          m, n, k,\n          alpha,\n          a_host, ld_a,\n          b_host, ld_b,\n          beta,\n          c_host, ld_c,\n          pin_buffers, copy_c_back);\n\n// if copy_c_back == false, the result is stored on the device with the following pointer:\ndouble* c_device = ctx-\u003eget_full_device_buffer_c().data()\n```\nWhen creating the context, the user can specify tile dimensions and the number of streams to be used as:\n```cpp\nint tile_size_m = 5000;\nint tile_size_n = 5000;\nint tile_size_k = 5000;\nint n_streams = 2;\n\nauto ctx = gpu::make_context(n_streams, tile_size_m, tile_size_n, tile_size_k);\n```\n## Running the Benchmarks\n\nFor detailed benchmarking, there is a miniapp that takes the host pointers for A, B and C and computes `C = beta * C + alpha * A * B` outputing the time-to-solution, as well as the throughput.\n\nThe miniapp consists of the executable `./build/examples/multiply` which can be run with the following command line (assuming we are in the root folder of the project):\n```bash\n./build/examples/multiply -m 10000 -n 10000 -k 10000 -r 1\n```\nThe overview of all supported options is given below:\nOption Flags | POSSIBLE VALUES | DESCRIPTION\n| :------------------- | :------------------- |:------------------- |\n`m (--m_dim)` | positive integer | Number of rows of `C`\n`n (--n_dim)` | positive integer | Number of columns of `C`\n`k (--k_dim)` | positive integer | size of the shared dimension between matrices `A` and `B`\n`--tile_m` | positive integer | tile size for dimension `m`\n`--tile_n` | positive integer | tile size for dimension `n`\n`--tile_k` | positive integer | tile size for dimension `k`\n`--ld_a` | positive integer | leading dimension of matrix `A`\n`--ld_b` | positive integer | leading dimension of matrix `B`\n`--ld_c` | positive integer | leading dimension of matrix `C`\n`-t (--transpose)` | a string XY, where X, Y can be one of {N, T, C} | transpose flags for matrices A and B\n`--alpha` | real value (double) | the `alpha` in `C = beta * C + alpha * A * B`\n`--beta` | real value (double) | the `beta` in `C = beta * C + alpha * A * B`\n\nFor example, running with the following flags:\n```bash\n./build/examples/multiply -m 1000 -n 1000 -k 1000 --transpose=TN -r 1\n```\nshould produce the following output:\n```bash\n==================================================\n                Benchmarking Tiled-MM\n==================================================\n         MATRIX SIZES\n=============================\n A = (1000, 1000)\n B = (1000, 1000)\n C = (1000, 1000)\n=============================\n         LEADING DIMS\n=============================\n LD_A = 1000\n LD_B = 1000\n LD_C = 1000\n=============================\n      SCALING CONSTANTS\n=============================\n alpha = 1\n beta  = 1\n=============================\n      TRANSPOSE FLAGS\n=============================\n trans_a = T\n trans_b = N\n=============================\n         TILE SIZES\n=============================\n tile_m = 5000\n tile_n = 5000\n tile_k = 5000\n=============================\n      ADDITIONAL OPTIONS\n=============================\n num. of gpu streams = 2\n num. of repetitions = 1\n=============================\n\n==================================================\n         Results of benchmarking Tiled-MM\n==================================================\n 1) The version with copying C to back to host:\n    -\u003e Avg Time [ms] = 11\n    -\u003e Throughput [Gflops] = 181.818\n==================================================\n 2) The version without copying C to back to host:\n    -\u003e Avg Time [ms] = 10\n    -\u003e Throughput [Gflops] = 200\n==================================================\n```\n\n## Testing\n\nFor testing purposes, there is a testing miniapp that generates random matrices A, B and C, computes `C = beta * C + alpha * A * B` with Tiled-MM as well as with blas and outputs whether the results are correct.\n\nThe miniapp consists of the executable `./build/tests/test-multiply` **supports the same parameters** as the benchmarking miniapp (see above). It can be run e.g. with the following command line (assuming we are in the root folder of the project):\n```bash\n./build/tests/test-multiply -m 1000 -n 1000 -k 1000 --transpose=TN\n```\nwhich should produce the following output:\n```bash\n==================================================\n                Benchmarking Tiled-MM\n==================================================\n         MATRIX SIZES\n=============================\n A = (1000, 1000)\n B = (1000, 1000)\n C = (1000, 1000)\n=============================\n         LEADING DIMS\n=============================\n LD_A = 1000\n LD_B = 1000\n LD_C = 1000\n=============================\n      SCALING CONSTANTS\n=============================\n alpha = 1\n beta  = 1\n=============================\n      TRANSPOSE FLAGS\n=============================\n trans_a = T\n trans_b = N\n=============================\n         TILE SIZES\n=============================\n tile_m = 5000\n tile_n = 5000\n tile_k = 5000\n=============================\n      ADDITIONAL OPTIONS\n=============================\n num. of gpu streams = 2\n num. of repetitions = 1\n=============================\nTime [ms] with copying C back: 11\nTime [ms] without copying C back: 10\nThe result is CORRECT\n```\nRunning `make test` will few default tests.\n\n## Author\nMarko Kabic (marko.kabic@inf.ethz.ch)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-cscs%2Ftiled-mm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feth-cscs%2Ftiled-mm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feth-cscs%2Ftiled-mm/lists"}