{"id":18422015,"url":"https://github.com/spcl/gemm_hls","last_synced_at":"2025-05-16T12:11:02.559Z","repository":{"id":47075941,"uuid":"135713410","full_name":"spcl/gemm_hls","owner":"spcl","description":"Scalable systolic array-based matrix-matrix multiplication implemented in Vivado HLS for Xilinx FPGAs.","archived":false,"fork":false,"pushed_at":"2025-01-20T20:53:22.000Z","size":17540,"stargazers_count":332,"open_issues_count":6,"forks_count":57,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-04-03T18:11:54.713Z","etag":null,"topics":["fpga","high-level-synthesis","hls","vivado-hls"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spcl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-01T12:11:31.000Z","updated_at":"2025-03-29T01:18:32.000Z","dependencies_parsed_at":"2025-02-27T13:25:12.940Z","dependency_job_id":"c5169c7d-2009-4110-bf55-a4b26277fab2","html_url":"https://github.com/spcl/gemm_hls","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spcl%2Fgemm_hls","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spcl%2Fgemm_hls/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spcl%2Fgemm_hls/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spcl%2Fgemm_hls/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spcl","download_url":"https://codeload.github.com/spcl/gemm_hls/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248525566,"owners_count":21118699,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fpga","high-level-synthesis","hls","vivado-hls"],"created_at":"2024-11-06T04:27:45.223Z","updated_at":"2025-04-12T06:20:19.293Z","avatar_url":"https://github.com/spcl.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"Scalable matrix matrix multiplication on FPGA\n=============================================\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3952084.svg)](https://doi.org/10.5281/zenodo.3952084)\n\nThis repository includes a pure Vitis HLS implementation of matrix-matrix multiplication (A\\*B=C) for Xilinx FPGAs, using Xilinx Vitis to instantiate memory and PCIe controllers and interface with the host. \n\nExperiments run on a [VCU1525](https://www.xilinx.com/products/boards-and-kits/vcu1525-a.html) achieved 462 GFLOP/s, 301 GFLOP/s and 132 GFLOP/s for half, single, and double precision, respectively, with routing across the three SLRs being the primary bottleneck preventing further scaling. The code is not device-specific, and can be configured for any Xilinx FPGA supported by the Xilinx OpenCL runtime.  Kernels have also been verified to execute on TUL KU115, Alveo U250, and Alveo U280 boards with similar results.\n\nThe implementation uses a systolic array approach, where linearly connected processing elements compute distinct contributions to the outer product of tiles of the output matrix. \n\nThe approach used to implement this kernel was presented at [FPGA'20](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [1].  For a general description of the optimization techniques that we apply, we refer to our article on [HLS transformations](https://spcl.inf.ethz.ch/Publications/.pdf/hls-transformations.pdf) [2].  We also gave [a tutorial on HLS](https://spcl.inf.ethz.ch/Teaching/hls-tutorial/) for HPC at SC'21, ISC'21, SC'20, HiPEAC'20, SC'19, SC'18, and PPoPP'18.\n\nDownloading the code\n--------------------\n\nThis project uses the open source Vivado HLS extension library [hlslib](https://github.com/definelicht/hlslib) [3] for simulation, vectorization, finding Xilinx tools, host-side integration and more.\n\nSince hlslib is included as a submodule, make sure you clone with `--recursive` or grab it after cloning with:\n\n```\ngit submodule update --init \n```\n\nPrerequisites\n-------------\n\nTo build and run kernels in hardware, Xilinx [Vitis](https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vitis.html) must be installed and available on the PATH (tested on Alveo U250 and Alveo U280 with version 2021.1).\n\nConfiguration and running\n-------------------------\n\nThis project is configured and built using CMake. Most parameters must be set at configuration-time, as they are used to specialize the hardware.\n\nAn example of configuring and building the kernel and executing it in hardware is shown below (starting from the source directory):\n\n```bash\nmkdir build\ncd build\ncmake ../ -DMM_DATA_TYPE=float -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512\nmake\nmake hw\n./RunHardware.exe 1024 1024 1024 hw\n```\n\nMatrix sizes use the convention that `A: NxK`, `B: KxM`, and `C: NxM`.\n\nPer default the build targets the Alveo U250 acceleration board, but this can be configured using the `MM_PLATFORM` CMake parameter.\n\nThe implementation is not restricted to use multiplication and addition as operators. To use other operators, for example addition and minimum to implement the [distance product](https://en.wikipedia.org/wiki/Min-plus_matrix_multiplication), specify them using the `MM_MAP_OP` and `MM_REDUCE_OP` CMake parameters, respectively. To see which operators are pre-implemented, and examples of how to implement new operators,  see `hlslib/include/hlslib/xilinx/Operators.h`.\n\nSelecting tile sizes\n--------------------\n\nSee our [publication at FPGA'20](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [1] on how to choose tile sizes for optimal fast memory and compute utilization.\n\nParallel performance\n--------------------\n\nThe amount of parallelism in the code is determined by the `MM_PARALLELISM_N` and `MM_PARALLELISM_M` configuration variables. The former determines the number of processing element instantiated, and the latter regulates the vector width/granularity of each processing element.  `MM_PARALLELISM_M` should be set to a maximum of 64 bytes / `sizeof(\u003cyour operand\u003e)` (i.e., 8 for `float` or `int`, 4 for `double` or `long`, 16 for 16-bit `int`, etc.) to avoid performance and routing issues.\n\nThe expected performance in Op/s (FLOP/s in the case of floating point types) of a given configuration can be computed as:\n\n`2 * MM_PARALLELISM_N * MM_PARALLELISM_M * Frequency`\n\nIn practice, `MM_PARALLELISM_N` buffered values of A are applied to `MM_PARALLELISM_M` values of B. \n\nBugs\n----\n\nIf you experience bugs, or have suggestions for improvements, please use the issue tracker to report them.\n\nPublication\n-----------\n\nIf this code has been useful to your research, please consider citing us:\n\n**BibTeX:**\n```\n@inproceedings{mmm_hls,\n  title={Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis},\n  author={de~Fine~Licht, Johannes and Kwasniewski, Grzegorz and Hoefler, Torsten},\n  booktitle={The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20)},\n  year={2020}\n}\n```\n\n**Plain text:**\n```\nJohannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler. \"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis.\" In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20).\n```\n\nReferences\n----------\n\n[1] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, _\"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis\"_, in Proceedings of 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20), 2020.\n\n[2] Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. _\"Transformations of High-Level Synthesis Codes for High-Performance Computing.\"_ IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 32, Issue 5, 2021.\n\n[3] Johannes de Fine Licht, and Torsten Hoefler. _\"hlslib: Software Engineering for Hardware Design.\"_, presented at the Fifth International Workshop on\nHeterogeneous High-performance Reconfigurable Computing (H2RC'19).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspcl%2Fgemm_hls","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspcl%2Fgemm_hls","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspcl%2Fgemm_hls/lists"}