{"id":13809612,"url":"https://github.com/jundaf2/CUDA-INT8-GEMM","last_synced_at":"2025-05-14T08:33:02.552Z","repository":{"id":175600751,"uuid":"654151345","full_name":"jundaf2/CUDA-INT8-GEMM","owner":"jundaf2","description":"CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API","archived":false,"fork":false,"pushed_at":"2023-09-15T18:38:30.000Z","size":4494,"stargazers_count":30,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T09:11:51.179Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jundaf2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-06-15T13:46:24.000Z","updated_at":"2025-04-12T14:42:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"eaf17a76-5d11-41a8-bdf1-34393a5a6479","html_url":"https://github.com/jundaf2/CUDA-INT8-GEMM","commit_stats":null,"previous_names":["jundaf2/cuda-int8-gemm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FCUDA-INT8-GEMM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FCUDA-INT8-GEMM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FCUDA-INT8-GEMM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jundaf2%2FCUDA-INT8-GEMM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jundaf2","download_url":"https://codeload.github.com/jundaf2/CUDA-INT8-GEMM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254104887,"owners_count":22015558,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-04T02:00:32.482Z","updated_at":"2025-05-14T08:32:57.543Z","avatar_url":"https://github.com/jundaf2.png","language":"Cuda","funding_links":[],"categories":["Learning Resources","Example Implementations 💡"],"sub_categories":["Blogs 🖋️"],"readme":"# CUDA-INT8-GEMM\n## Introduction\nThe 8-bit GEMM takes two 8-bit input matrices and produces an output matrix which is also of 8-bit.  \n\nC = A*B^T\n\nWe adopt the same convention as the cuBLAS library, where the matrices are stored in column-major order. `GEMM_OP_T` means the matrix is transposed in column-major representation, which is equivalent to the non-transposed matrix in row-major representation. `GEMM_OP_N` means the matrix is not transposed in column-major representation, which is equivalent to the transposed matrix in row-major representation. The same convention applies to matrix C.\n\nYou may undersand the `T` and `N` in these flags as either `transpose` / `non-transpose` operation for col-major BLAS (Fortran) matrices or  `true` / `not true` for row-major C/C++ matrices.\n\n## the 8-bit WMMA Tensor Core API with Shape m16n16k16\nSince there is no single PTX instruction to perform a m16n16k16 8-bit matrix multiplicaiton, we think the buildin intrinsic `__imma_m16n16k16_mma_s8` is composed of 4 `mma.sync.aligned.m8n8k16.row.s32.s8.s8.s32` instructions. The following figure shows how the four 8-bit m8n8k16 instructions resulting in one m16n16k16 buildin intrinsic. For simplicity without much consideration for the performance in this example, we will use `cp.async.ca.shared.global` to load the data from global memory to shared memory asynchronously. `wmma::load_matrix_sync` will load the data from shared memory to register. `wmma::mma_sync` will perform the matrix multiplication. \n\nFor the detailed register data layout of the WMMA 8-bit m16n16k16 API, please see the following figure\n\n\u003ccenter\u003e\u003cimg src=\"./in8_tensor_core_wmma.png\" ...\u003e\u003c/center\u003e\n\n## Current feature\n\nThe output is also of type `int8`. For example, when you use GEMM in a 8-bit framework, you may want to use `int8` output as the input of next layer's operation in spite of the fact that the tensor core itself uses `int32` as accumalator.\n\nPerformance is quite poor due to\n* unsolved bank conflict when loading the data from shared memory to register\n* unoptimized global memory write\n\nCurrently, you can try different size of matrix multiplication with the following cmd (potentially you need to tune the block size and grid size in the code):\n``` \n    ./test_gemm_i8 1024 1024 1024 1 0 1 1\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjundaf2%2FCUDA-INT8-GEMM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjundaf2%2FCUDA-INT8-GEMM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjundaf2%2FCUDA-INT8-GEMM/lists"}