{"id":22832427,"url":"https://github.com/microsoft/tilefusion","last_synced_at":"2025-04-10T02:19:40.475Z","repository":{"id":267837469,"uuid":"870511298","full_name":"microsoft/TileFusion","owner":"microsoft","description":"TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing. ","archived":false,"fork":false,"pushed_at":"2025-04-08T09:54:12.000Z","size":486,"stargazers_count":78,"open_issues_count":10,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-08T10:12:22.872Z","etag":null,"topics":["cpp","cuda-kernels"],"latest_commit_sha":null,"homepage":"https://tiledtensor.github.io/tilefusion-docs/","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-10T07:17:00.000Z","updated_at":"2025-04-08T09:54:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"7a230b08-318f-4c42-bc71-02fec2953a80","html_url":"https://github.com/microsoft/TileFusion","commit_stats":null,"previous_names":["microsoft/tilefusion"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FTileFusion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FTileFusion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FTileFusion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FTileFusion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/TileFusion/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248142916,"owners_count":21054672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cuda-kernels"],"created_at":"2024-12-12T21:07:27.855Z","updated_at":"2025-04-10T02:19:40.466Z","avatar_url":"https://github.com/microsoft.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/TileFusion-logo.png\" width=\"120\"/\u003e\n  \u003ch1\u003eTileFusion: A High-Level, Modular Tile Processing Library\u003c/h1\u003e\n  \u003cp\u003e\n    \u003ca href=\"https://tiledtensor.github.io/tilefusion-docs/docs/installation.html\"\u003e\u003cb\u003eInstallation\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://tiledtensor.github.io/tilefusion-docs/docs/examples/\"\u003e \u003cb\u003eGetting Started\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://github.com/microsoft/TileFusion/tree/master/examples\"\u003e\u003cb\u003eExamples\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://tiledtensor.github.io/tilefusion-docs/\"\u003e\u003cb\u003eDocumentation\u003c/b\u003e\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/div\u003e\n\n## Overview\n\n**TileFusion**, derived from the research presented in this [paper](https://dl.acm.org/doi/pdf/10.1145/3694715.3695961), is an efficient C++ macro kernel library designed to elevate the level of abstraction in CUDA C for tile processing. The library offers:\n\n- **Higher-Level Programming Constructs**: TileFusion supports tiles across the three-level GPU memory hierarchy, providing device kernels for transferring tiles between CUDA memory hierarchies and for tile computation.\n- **Modularity**: TileFusion enables applications to process larger tiles built out of BaseTiles in both time and space, abstracting away low-level hardware details.\n- **Efficiency**: The library's BaseTiles are designed to match TensorCore instruction shapes and encapsulate hardware-specific performance parameters, ensuring optimal utilization of TensorCore capabilities.\n\nA core design goal of **TileFusion** is to allow users to understand and utilize provided primitives using logical concepts, without delving into low-level hardware complexities. The library rigorously separates data flow across the memory hierarchy from the configuration of individual macro kernels. This design choice enables performance enhancements through tuning, which operates in three possible ways:\n\n- **Structural Tuning**: Designs various data flows while keeping kernel configurations unchanged.\n- **Parameterized Tuning**: Adjusts kernel configurations while maintaining the same data flow.\n- **Combined Tuning**: Integrates both structural and parameterized tuning approaches simultaneously.\n\nIn summary, **TileFusion** encourages algorithm developers to focus on designing the data flow of their algorithms using efficient tile primitives. It can be utilized as:\n\n1. A lightweight C++ library with header-only usage, offering superior readability, modifiability, and debuggability.\n2. A Python library with pre-existing kernels bound to PyTorch.\n\n## The Basic GEMM Example\n\nTileFusion approaches the efficient implementation of a kernel by:\n\n1. Managing dataflow over memory hierarchies.\n2. Configuring tile primitives, such as tile shapes, layouts, and other parameters.\n\nThis is an example of a simple GEMM (General Matrix Multiplication) kernel written using TileFusion. For the complete example, please refer to [this directory](examples/cpp/01_gemm/01_gemm_global_reg/gemm.hpp).\n\n### Configuration of the Tile Primitives\n\nThe core programming constructs in TileFusion are `Tile`, `TileLayout`, `TileIterator`, `Loader`, and `Storer`.\n\n1. **Declare the `Tile`**: [GlobalTile](https://github.com/microsoft/TileFusion/blob/master/include/types/global.hpp) and [RegTile](https://github.com/microsoft/TileFusion/blob/master/include/types/register.hpp) are utilized to customize the shape and layout of 1D (vector) or 2D (matrix) arrays within the GPU's three memory hierarchies, known as a *Tile*.\n\n2. **Declare the `TileIterator`**: Partition the `GlobalTile` into smaller, manageable sub-tiles for efficient processing.\n\n3. **Declare Loader and Storer**: Loaders and Storers use cooperative threads to transfer a tile from the source to the target location. They operate at the CTA level and accept the following inputs:\n\n   - **Warp Layout**\n   - **Target Tile**\n   - **Source Tile**\n\n   Based on these parameters, they automatically infer a copy plan that partitions the data transfer work among the threads.\n\n```cpp\n1  using WarpLayout = RowMajor\u003c2, 2\u003e;\n2\n3  // operand A\n4  using GlobalA = GlobalTile\u003cInType, RowMajor\u003c128, 256\u003e\u003e;\n5  using IteratorA = TileIterator\u003cGlobalA, TileShape\u003c128, 32\u003e\u003e;\n6  using RegA = RegTile\u003cBaseTileRowMajor\u003c__half\u003e, RowMajor\u003c8, 8\u003e\u003e;\n7  using ALoader = GlobalToRegLoader\u003cRegA, WarpLayout, kRowReuseCont\u003e;\n8\n9  // operand B\n10 using GlobalB = GlobalTile\u003cInType, ColMajor\u003c256, 64\u003e\u003e;\n11 using IteratorB = TileIterator\u003cGlobalB, TileShape\u003c32, 64\u003e\u003e;\n12 using RegB = RegTile\u003cBaseTileColMajor\u003c__half\u003e, ColMajor\u003c8, 4\u003e\u003e;\n13 using BLoader = GlobalToRegLoader\u003cRegB, WarpLayout, kColReuseCont\u003e;\n14\n15 // output C\n16 using GlobalC = GlobalTile\u003cAccType, RowMajor\u003c128, 64\u003e\u003e;\n17 using RegC = RegTile\u003cBaseTileRowMajor\u003cfloat\u003e, RowMajor\u003c8, 8\u003e\u003e;\n18 using CStorer = RegToGlobalStorer\u003cGlobalC, RegC, WarpLayout\u003e;\n```\n\n\u003e **Note**: To simplify the demonstration, this example involves only two memory levels: global memory and registers. TileFusion also applies similar concepts to [SharedTile](https://github.com/microsoft/TileFusion/blob/master/include/types/shared.hpp).\n\n### Dataflow Over Memory Hierarchies\n\nThe the kernel is defined as implementing the following dataflow over memory hierarchies:\n\n```cpp\n1  template \u003ctypename InType, typename AccType,\n2            typename IteratorA, typename RegA, typename LoaderA,\n3            typename IteratorB, typename RegB, typename LoaderB,\n4            typename GlobalC, typename RegC, typename CStorer\u003e\n5  __global__ void simple_gemm(const InType* dA, const InType* dB, AccType* dC) {\n6      IteratorA gAs(dA);\n7      RegA rA;\n8      LoaderA loader_a;\n9\n10     IteratorB gBs(dB);\n11     RegB rB;\n12     LoaderB loader_b;\n13\n14     RegC acc;\n15\n16     for (int k = 0; k \u003c IteratorA::sc1; ++k) {\n17         loader_a(gAs(k), rA);\n18         loader_b(gBs(k), rB);\n19         __syncthreads();\n20\n21         gemm(rA, rB, acc);\n22     }\n23     __syncthreads();\n24\n25     GlobalC gC(dC);\n26     CStorer storer_c;\n27     storer_c(acc, gC);\n28 }\n```\n\nThe `TileIterator` (`IteratorA`, `IteratorB` in lines 6 and 10) serves as a syntactic interface for defining tile partitions. It is used to divide the `GlobalTile` into smaller sub-tiles and iterate over them.\n\n`Loader` and `Storer` (declared in lines 8, 12, and 26) are efficient methods for loading and storing data, transferring data between memory hierarchies using specialized hardware-accelerated instructions (lines 17, 18, and 27). Tiles of data are cooperatively loaded into the `RegTile`, which is stored in each thread's local register file.\n\nOnce the data is loaded into a thread's local register file, `gemm` (in line 21) performs matrix multiplication using TensorCore's warp-level matrix multiply-and-accumulate (WMMA) instruction on the `BaseTile`s. The specialized data distribution required by TensorCore is automatically maintained by TileFusion's `RegTile` layout.\n\nAfter the `gemm` operation is completed, the data in the `RegTile` is cooperatively stored back from registers to global memory using the `RegToGlobalStorer`.\n\n## Installation\n\nTileFusion can be used as a lightweight C++ library with header-only usage, or it can be built as a Python library. You can choose to build either one.\n\n### Prerequisites\n\nTileFusion requires:\n\n- C++20 host compiler\n- CUDA 12.0 or later\n- GCC version 10.0 or higher to support C++20 features\n\nDownload the repository:\n\n```bash\ngit clone git@github.com:microsoft/TileFusion.git\ncd TileFusion \u0026\u0026 git submodule update --init --recursive\n```\n\n### Building the C++ Library\n\nTo build the project using the provided `Makefile`, simply run:\n\n```bash\nmake\n```\n\nTo run a single C++ unit test:\n\n```bash\nmake unit_test_cpp CPP_UT=test_gemm\n```\n\n### Building the Python Package\n\n1. Build the wheel:\n\n   ```bash\n   python setup.py build bdist_wheel\n   ```\n\n2. Clean the build:\n\n   ```bash\n   python setup.py clean\n   ```\n\n3. Install the Python package in editable mode (recommended for development):\n\n   ```bash\n   python setup.py develop\n   ```\n\n   This allows you to edit the source code directly without needing to reinstall it repeatedly.\n\n### Running Unit Tests\n\nBefore running the Python unit tests, you need to build and install the Python package (see the [Building the Python Package](#building-the-python-package) section).\n\n- **Run a single Python unit test**:\n\n  ```bash\n  pytest tests/python/test_scatter_nd.py\n  ```\n\n- **Run all Python unit tests**:\n\n  ```bash\n  python setup.py pytests\n  ```\n\n- **Run all C++ unit tests**:\n\n  ```bash\n  python setup.py ctests\n  ```\n\n## Contributing\n\nThis project welcomes contributions and suggestions. Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit \u003chttps://cla.opensource.microsoft.com\u003e.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ftilefusion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Ftilefusion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Ftilefusion/lists"}