{"id":13830855,"url":"https://github.com/LuisaGroup/LuisaCompute","last_synced_at":"2025-07-09T12:34:13.387Z","repository":{"id":61534259,"uuid":"314634841","full_name":"LuisaGroup/LuisaCompute","owner":"LuisaGroup","description":"High-Performance Rendering Framework on Stream Architectures","archived":false,"fork":false,"pushed_at":"2024-10-29T08:32:36.000Z","size":163088,"stargazers_count":718,"open_issues_count":13,"forks_count":64,"subscribers_count":26,"default_branch":"stable","last_synced_at":"2024-10-29T09:35:11.330Z","etag":null,"topics":["cpu","cross-platform","cuda","directx","dsl","dxr","gpu","graphics","high-performance","ispc","llvm","metal","optix","raytracing","rendering","rtx","siggraph-asia-2022"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LuisaGroup.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-20T18:16:01.000Z","updated_at":"2024-10-28T05:16:41.000Z","dependencies_parsed_at":"2024-04-19T05:26:03.759Z","dependency_job_id":"01f86a44-6df4-4e70-a60f-a52e43a32e2d","html_url":"https://github.com/LuisaGroup/LuisaCompute","commit_stats":{"total_commits":6240,"total_committers":38,"mean_commits":"164.21052631578948","dds":0.5096153846153846,"last_synced_commit":"6d849cb813ccd2ec2fa01c21c752af690c4b1487"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuisaGroup%2FLuisaCompute","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuisaGroup%2FLuisaCompute/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuisaGroup%2FLuisaCompute/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LuisaGroup%2FLuisaCompute/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LuisaGroup","download_url":"https://codeload.github.com/LuisaGroup/LuisaCompute/tar.gz/refs/heads/stable","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224944553,"owners_count":17396257,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpu","cross-platform","cuda","directx","dsl","dxr","gpu","graphics","high-performance","ispc","llvm","metal","optix","raytracing","rendering","rtx","siggraph-asia-2022"],"created_at":"2024-08-04T10:01:10.535Z","updated_at":"2025-07-09T12:34:13.373Z","avatar_url":"https://github.com/LuisaGroup.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# LuisaCompute [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/LuisaGroup/LuisaCompute)\n\n![teaser](https://user-images.githubusercontent.com/7614925/195987646-fe932ebe-ca6e-4d6e-ab2a-203bcfd3d559.jpg)\n\nLuisaCompute is a high-performance cross-platform computing framework for graphics and beyond.\n\nLuisaCompute is also the *rendering framework* described in the **SIGGRAPH Asia 2022** paper\n\u003e ***LuisaRender: A High-Performance Rendering Framework with Layered and Unified Interfaces on Stream Architectures***.\n\nSee also [LuisaRender](https://github.com/LuisaGroup/LuisaRender) for the *rendering application* as described in the paper; and please visit the [project page](https://luisa-render.com) for other information about the paper and the project.\n\nWelcome to join the [discussion channel on Discord](https://discord.com/invite/ymYEBkUa7F)!\n\n对于中国大陆的用户，也欢迎加入我们的 QQ 群组：1050189593。\n\n## Table of Contents\n\n- [LuisaCompute](#luisacompute)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n    - [Embedded Domain-Specific Language](#embedded-domain-specific-language)\n    - [Unified Runtime with Resource Wrappers](#unified-runtime-with-resource-wrappers)\n    - [Multiple Backends](#multiple-backends)\n    - [Python Frontend](#python-frontend)\n    - [C API and Frontends in Other Languages](#c-api-and-frontends-in-other-languages)\n  - [Building](#building)\n  - [Usage](#usage)\n    - [A Minimal Example](#a-minimal-example)\n    - [Basic Types](#basic-types)\n    - [Structures](#structures)\n    - [Built-in Functions](#built-in-functions)\n    - [Control Flows](#control-flows)\n    - [Callable and Kernels](#callable-and-kernels)\n    - [Backends, Context, Devices and Resources](#backends-context-devices-and-resources)\n    - [Command Submission and Synchronization](#command-submission-and-synchronization)\n    - [Automatic Differentiation](#automatic-differentiation)\n  - [Applications](#applications)\n  - [Documentation and Tutorials](#documentation-and-tutorials)\n  - [Roadmap](#roadmap)\n  - [Citation](#citation)\n\n\n## Overview\n\nLuisaCompute seeks to balance the seemingly ever-conflicting pursuits for ***unification***, ***programmability***, and ***performance***. To achieve this goal, we design three major components:\n- A domain-specific language (DSL) embedded inside modern C++ for kernel programming exploiting JIT code generation and compilation;\n- A unified runtime with resource wrappers for cross-platform resource management and command scheduling; and\n- Multiple optimized backends, including CUDA, DirectX, Metal, and CPU.\n\nTo demonstrate the practicality of the system, we also build a Monte Carlo renderer, [LuisaRender](https://github.com/LuisaGroup/LuisaRender), atop the framework, which is faster than the state-of-the-art rendering frameworks on modern GPUs.\n\n### Embedded Domain-Specific Language\n\nThe DSL in our system provides a unified approach to authoring kernels, i.e., programmable computation tasks on the device. Distinct from typical graphics APIs that use standalone shading languages for device code, our system unifies the authoring of both the host-side logic and device-side kernels into the same language, i.e., modern C++.\n\nThe implementation purely relies on the C++ language itself, without any custom preprocessing pass or compiler extension. We exploit meta-programming techniques to simulate the syntax, and function/operator overloading to dynamically trace the user-defined kernels. ASTs are constructed during the tracing as an intermediate representation and later handed over to the backends for generating concrete, platform-dependent shader source code.\n\nExample program in the embedded DSL:\n```cpp\nCallable to_srgb = [](Float3 x) {\n    $if (x \u003c= 0.00031308f) {\n        x = 12.92f * x;\n    } $else {\n        x = 1.055f * pow(x, 1.f / 2.4f) - .055f;\n    };\n    return x;\n};\nKernel2D fill = [\u0026](ImageFloat image) {\n    auto coord = dispatch_id().xy();\n    auto size = make_float2(dispatch_size().xy());\n    auto rg = make_float2(coord) / size;\n    // invoke the callable\n    auto srgb = to_srgb(make_float3(rg, 1.f));\n    image.write(coord, make_float4(srgb, 1.f));\n};\n```\n\n### Unified Runtime with Resource Wrappers\n\nLike the RHIs in game engines, we introduce an abstract runtime layer to re-unify the fragmented graphics APIs across platforms. It extracts the common concepts and constructs shared by the backend APIs and plays the bridging role between the high-level frontend interfaces and the low-level backend implementations.\n\nOn the programming interfaces for users, we provide high-level resource wrappers to ease programming and eliminate boilerplate code. They are strongly and statically typed modern C++ objects, which not only simplify the generation of commands via convenient member methods but also support close interaction with the DSL. Moreover, with the resource usage information in kernels and commands, the runtime automatically probes the dependencies between commands and re-schedules them to improve hardware utilization.\n\n### Multiple Backends\n\nThe backends are the final realizers of computation. They generate concrete shader sources from the ASTs and compile them into native shaders. They implement the virtual device interfaces with low-level platform-dependent API calls and translate the intermediate command representations into native kernel launches and command dispatches.\n\nCurrently, we have 3 working GPU backends for the C++ and Python frontends, based on CUDA, Metal, and DirectX, respectively, and a CPU backend (re-)implemented in Rust for debugging purpose and fallback.\n\n### Python Frontend\n\nBesides the native C++ DSL and runtime interfaces, we are also working on a Python frontend and have published early-access packages to PyPI. You may install the pre-built wheels with pip (Python \u003e= 3.10 required):\n```bash\npython -m pip install luisa-python\n```\n\nYou may also build your own wheels with pip:\n```bash\npython -m pip wheel \u003cpath-to-project\u003e -w \u003coutput-dir\u003e\n```\n\nExamples using the Python frontend can be found under `src/tests/python`.\n\n\u003e Note: Due to the different syntax and idioms between Python and C++, the Python frontend does not 1:1 reflects the C++ DSL and APIs. For instance, Python does not have a dedicated reference type qualifier, so we follow the Python idiom that structures and arrays are passed as references to `@luisa.func` and built-in types (scalar, vector, matrix, etc.) as values by default.\n\n### C API and Frontends in Other Languages\n\nWe are also making a C API for creating other language bindings and frontends (e.g., in [Rust](https://github.com/LuisaGroup/luisa-compute-rs) and C#).\n\n## Building\n\n\u003e Note: LuisaCompute is a *rendering framework* rather than a *renderer* itself. It is designed to provide general computation functionalities on modern stream-processing hardware, on which high-performance, cross-platform graphics applications can be easily built. If you would like to just try a Monte Carlo renderer out of the box rather than building one from scratch, please see [LuisaRender](https://github.com/LuisaGroup/LuisaRender).\n\n### Preparation\n- Check your hardware and platform. Currently, we support CUDA on Linux and Windows; DirectX on Windows; Metal on macOS; and CPU on all the major platforms. For CUDA, an RTX-enabled graphics card, e.g., NVIDIA RTX 20 and 30 series, is required. For DirectX, a DirectX-12.1 \u0026 Shader Model 6.5 compatible graphics card is required.\n\n- Prepare the environment and dependencies. We recommend using the latest IDEs, Compilers, XMake/CMake, CUDA drivers, etc. Since we aggressively use new technologies like C++20 and OptiX 8, you may need to, for example, upgrade your VS to 2019 or 2022 and install CUDA 11.7+ and NVIDIA driver R535+.\n\n- Clone the repo with the `--recursive` option:\n    ```bash\n    git clone -b next https://github.com/LuisaGroup/LuisaCompute.git/ --recursive\n    ```\n  Since we use Git submodules to manage third-party dependencies, a `--recursive` clone is required.\n\n- Detailed requirements for each platform are listed in [BUILD.md](BUILD.md).\n\n### Build via the Bootstrap Script\nThe easiest way to build LuisaCompute is to use the bootstrap script. It can even download and install the required dependencies and build the project.\n```bash\npython bootstrap.py cmake -f cuda -b # build with CUDA backend using CMake\npython bootstrap.py cmake -f cuda -b -- -DCMAKE_BUILD_TYPE=RelWithDebInfo # everything after -- will be passed to CMake\n```\n\nYou may specify `-f all` to enable all available features on your platform.\n\nTo install certain dependencies, you can use the `--install` or `-i` option. For example, to install Rust, you can use:\n```bash\npython bootstrap.py -i rust\n```\n\nAlternatively, the bootstrap script can output a configuration file for build system without actually building the project. This is useful when you want to use the project inside IDE.\n```bash\npython bootstrap.py cmake -f cuda -c -o cmake-build-release # generate CMake configuration in ./cmake-build-release\n```\n\nPlease use `python bootstrap.py --help` for more details.\n\n### Build from Source with XMake/CMake\nLuisaCompute follows the standard [XMake](https://xmake.io/) and [CMake](https://cmake.org/) build process. Please see also [BUILD.md](BUILD.md) for details on platform requirements, configuration options, and other precautions.\n\n## Usage\n\n### A Minimal Example\n\nCurrently, we suggest using LuisaCompute as a submodule. For quick start with CMake, you can find the project template [here](https://github.com/LuisaGroup/CMakeStarterTemplate).\n\nGenerally, using LuisaCompute to construct a graphics application basically involves the following steps:\n\n1. Create a `Context` and loading a `Device` plug-in;\n2. Create a `Stream` for command submission and other device resources (e.g., `Buffer\u003cT\u003e`s for linear storage, `Image\u003cT\u003e`s for 2D readable/writable textures, and `Mesh`es and `Accel`s for ray-scene intersection testing structures) via `Device`'s `create_*` interfaces;\n3. Author `Kernel`s to describe the on-device computation tasks, and compile them into `Shader`s via `Device`'s `compile` interface;\n4. Generate `Command`s via each resource's interface (e.g., `Buffer\u003cT\u003e::copy_to`), or `Shader`'s `operator()` and `dispatch`, and submit them to the stream;\n5. Wait for the results by inserting a `synchronize` phoney command to the `Stream`.\n\nPutting the above together, a minimal example program that write gradient color to an image would look like\n```cpp\n\n#include \u003cluisa/luisa-compute.h\u003e\n\n// For the DSL sugar macros like $if.\n// We exclude this header from \u003cluisa-compute.h\u003e to avoid pollution.\n// So you have to include it explicitly to use the sugar macros.\n#include \u003cluisa/dsl/sugar.h\u003e\n\nusing namespace luisa;\nusing namespace luisa::compute;\n\nint main(int argc, char *argv[]) {\n\n    // Step 1.1: Create a context\n    Context context{argv[0]};\n    \n    // Step 1.2: Load the CUDA backend plug-in and create a device\n    Device device = context.create_device(\"cuda\");\n    \n    // Step 2.1: Create a stream for command submission\n    Stream stream = device.create_stream();\n    \n    // Step 2.2: Create an 1024x1024 image with 4-channel 8-bit storage for each pixel; the template \n    //           argument `float` indicates that pixel values reading from or writing to the image\n    //           are converted from `byte4` to `float4` or `float4` to `byte4` automatically\n    Image\u003cfloat\u003e device_image = device.create_image\u003cfloat\u003e(PixelStorage::BYTE4, 1024u, 1024u, 0u);\n    \n    // Step 3.1: Define kernels to describe the device-side computation\n    // \n    //           A `Callable` is a function *entity* (not directly inlined during \n    //           the AST recording) that is invocable from kernels or other callables\n    Callable linear_to_srgb = [](Float4 /* alias for Var\u003cfloat4\u003e */ linear) noexcept {\n        // The DSL syntax is much like the original C++\n        auto x = linear.xyz();\n        return make_float4(\n            select(1.055f * pow(x, 1.0f / 2.4f) - 0.055f,\n                   12.92f * x,\n                   x \u003c= 0.00031308f),\n            linear.w);\n    };\n    //           A `Kernel` is an *entry* function to the device workload \n    Kernel2D fill_image_kernel = [\u0026linear_to_srgb](ImageFloat /* alias for Var\u003cImage\u003cfloat\u003e\u003e */ image) noexcept {\n        Var coord = dispatch_id().xy();\n        Var rg = make_float2(coord) / make_float2(dispatch_size().xy());\n        image-\u003ewrite(coord, linear_to_srgb(make_float4(rg, 1.0f, 1.0f)));\n    };\n    \n    // Step 3.2: Compile the kernel into a shader (i.e., a runnable object on the device)\n    auto fill_image = device.compile(fill_image_kernel);\n    \n    // Prepare the host memory for holding the image\n    std::vector\u003cstd::byte\u003e download_image(1024u * 1024u * 4u);\n    \n    // Step 4: Generate commands from resources and shaders, and\n    //         submit them to the stream to execute on the device\n    stream \u003c\u003c fill_image(device_image.view(0)).dispatch(1024u, 1024u)\n           \u003c\u003c device_image.copy_to(download_image.data())\n           \u003c\u003c synchronize();// Step 5: Synchronize the stream\n   \n   // Now, you have the device-computed pixels in the host memory!\n   your_image_save_function(\"color.png\", download_image, 1024u, 1024u, 4u);\n}\n```\n\n### Basic Types\n\nIn addition to standard C++ scalar types (e.g., `int`, `uint` --- alias of `uint32_t`, `float`, and `bool`), LuisaCompute provides vector/matrix types for 3D graphics, including the following types:\n```cpp\n// boolean vectors\nusing bool2 = Vector\u003cbool, 2\u003e;   // alignment: 2B\nusing bool3 = Vector\u003cbool, 3\u003e;   // alignment: 4B\nusing bool4 = Vector\u003cbool, 4\u003e;   // alignment: 4B\n// signed and unsigned integer vectors\nusing int2 = Vector\u003cint, 2\u003e;     // alignment: 8B\nusing int3 = Vector\u003cint, 3\u003e;     // alignment: 16B\nusing int4 = Vector\u003cint, 4\u003e;     // alignment: 16B\nusing uint2 = Vector\u003cuint, 2\u003e;   // alignment: 8B\nusing uint3 = Vector\u003cuint, 3\u003e;   // alignment: 16B\nusing uint4 = Vector\u003cuint, 4\u003e;   // alignment: 16B\n// floating-point vectors and matrices\nusing float2 = Vector\u003cfloat, 2\u003e; // alignment: 8B\nusing float3 = Vector\u003cfloat, 3\u003e; // alignment: 16B\nusing float4 = Vector\u003cfloat, 4\u003e; // alignment: 16B\nusing float2x2 = Matrix\u003c2\u003e;      // column-major, alignment: 8B\nusing float3x3 = Matrix\u003c3\u003e;      // column-major, alignment: 16B\nusing float4x4 = Matrix\u003c4\u003e;      // column-major, alignment: 16B\n```\n\n\u003e ⚠️ Please pay attention to the alignment of 3D vectors and matrices --- they are aligned like 4D ones rather than packed. Also, we do not provide 64-bit integer or floating-point vector/matrix types, as they are less useful and typically unsupported on GPUs.\n\nTo make vectors/matrices, we provide `make_*` and read-only swizzle interfaces, e.g.,\n```cpp\nauto a = make_float2();       // (0.f, 0.f)\nauto b = make_int3(1);        // (1,   1,   1)\nauto c = make_uint3(b);       // (1u,  1u,  1u): converts from a same-dimentional but (possibly) differently typed vector\nauto d = make_float3(a, 1.f); // (0.f, 0.f, 1.f): construct float3 from float2 and a float scalar\nauto e = d.zzxy();            // (1.f, 1.f, 0.f, 0.f): swizzle\nauto m = make_float2x2(1.f);  // ((1.f, 0.f,), (0.f, 1.f)): diagonal matrix from a scalar\n...\n```\n\nOperators are also overloaded for scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix calculations, e.g.,\n```cpp\nauto one = make_float2(1.f); // (1.f, 1.f)\nauto two = 2.f;\nauto three = one + two;      // (3.f, 3.f), scalar broadcast to vector\nauto m2 = make_float2(2.f);  // ((2.f, 0.f), (0.f, 2.f))\nauto m3 = 1.5f * m2;         // ((3.f, 0.f), (0.f, 3.f)), scalar-matrix multiplication\nauto v = m3 * one;           // (3.f, 3.f), matrix-vector multiplication, the vector should always\n                             // appear at the right-hand side and is interpreted as a column vector\nauto m6 = m2 * m3;           // ((6.f, 0.f), (0.f, 6.f)), matrix-matrix multiplication\n```\n\nThe scalar, vector, matrix, and array types are also supported in the DSL, together with `make_*`, swizzles, and operators. Just wrap them in the `Var\u003cT\u003e` template or use the pre-defined aliases:\n```cpp\n// scalar types; note that 64-bit ones are not supported\nusing Int = Var\u003cint\u003e;\nusing UInt = Var\u003cuint\u003e;\nusing Float = Var\u003cfloat\u003e;\nusing Bool = Var\u003cbool\u003e;\n\n// vector types\nusing Int2 = Var\u003cint2\u003e; // = Var\u003cVector\u003cint, 2\u003e\u003e\nusing Int3 = Var\u003cint3\u003e; // = Var\u003cVector\u003cint, 3\u003e\u003e\n/* ... */\n\n// matrix types\nusing Float2x2 = Var\u003cfloat2x2\u003e; // = Var\u003cMatrix\u003c2\u003e\u003e\nusing Float3x3 = Var\u003cfloat3x3\u003e; // = Var\u003cMatrix\u003c3\u003e\u003e\nusing Float4x4 = Var\u003cfloat4x4\u003e; // = Var\u003cMatrix\u003c4\u003e\u003e\n\n// array types\ntemplate\u003ctypename T, size_t N\u003e\nusing ArrayVar = Var\u003cstd::array\u003cT, N\u003e\u003e;\n\n// make_*\nauto a = make_float2(one);    // Float2(1.f, 1.f), suppose one = Float(1.f)\nauto m = make_float2x2(a, a); // Float2x2((1.f, 1.f), (1.f, 1.f))\nauto c = make_int2(a);        // Int2(1, 1)\nauto d = c.xxx();             // Int3(1, 1, 1)\nauto e = d[0];                // 1\n/* ... */\n\n// operators\nauto v2 = a * 2.f;  // Float2(2.f, 2.f)\nauto eq = v2 == v2; // Bool2(true, true)\n/* ... */\n```\n\n\u003e ⚠️ The only exception is that we disable `operator\u0026\u0026` and `operator||` in the DSL for scalars. This is because the DSL does not support the *short-circuit* semantics. We disable them to avoid ambiguity. Please use `operator\u0026` and `operator|` instead, which have the consistent non-short-circuit semantics on both the host and device sides.\n\nBesides the `Var\u003cT\u003e` template, there's also an `Expr\u003cT\u003e`, which is to `Var\u003cT\u003e` what `const T \u0026` is to `T` on the host side. In other words, `Expr\u003cT\u003e` stands for a const DSL variable reference, which does not create variables copies when passed around. However, note that the parameters of `Callable`/`Kernel` definition functions may only be `Var\u003cT\u003e`. This restriction might be removed in the future.\n\nTo conveniently convert a C++ variable to the DSL, we provide a helper template function `def\u003cT\u003e`:\n```cpp\nauto a = def(1.f);              // equivalent to auto a = def\u003cfloat\u003e(1.f);\nauto b_host = make_float2(1.f); // host C++ variable float2(1.f, 1.f)\nauto b_device = def(b_host);    // device DSL variable Float2(1.f, 1.f)\n/* ... */\n```\n\n### Structures\n\nTo export a C++ data struct to the DSL, we provide a helper macro `LUISA_STRUCT`, which (semi-)automatically reflects the member layouts of the input structure:\n```cpp\n// A C++ data structure\nnamespace foo {\nstruct alignas(8) S {\n    float a;\n    int   b;\n};\n}\n\n// A reflected DSL structure\nLUISA_STRUCT(foo::S, a, b) {\n/* device-side member functions, e.g., */\n    [[nodiscard]] auto twice_a() const noexcept { return 2.f * a; }\n};\n```\n\n\u003e ⚠️ The `LUISA_STRUCT` may only be used in the global namespace. The C++ structure to be exported may only contain scalar, vector, matrix, array, and other already exported structure types. The alignment of the *whole* structure specified with `alignas` will be reflected but must be under 16B; member alignments specified with `alignas` are not supported.\n\n### Built-in Functions\n\nFor the DSL, we provide a rich set of built-in functions, in the following categories\n- Thread coordinate and launch configuration queries, including `block_id`, `thread_id`, `dispatch_size`, and `dispatch_id`;\n- Mathematical routines, such as `max`, `abs`, `sin`, `pow`, and `sqrt`;\n- Resource accessing and modification methods, such as texture sampling, buffer read/write, and ray intersection;\n- Variable construction and type conversion, e.g., the aforementioned `make_*`, `cast\u003cT\u003e` for static type casting, and `as\u003cT\u003e` for bitwise type casting; and\n- Optimization hints for backend compilers, which currently consist of `assume` and `unreachable`.\n\nThe mathematical functions basically mirrors [GLSL](https://www.khronos.org/opengl/wiki/Core_Language_(GLSL)). We are working on the documentations that will provide more descriptions on them.\n\n### Control Flows\n\nThe DSL in LuisaCompute supports device-side control flows. They are provided as special macros prefixed with `$`:\n```cpp\n$if (cond) { /*...*/ };\n$if (cond) { /*...*/ } $else { /*...*/ };\n$if (cond) { /*...*/ } $elif (cond2) { /*...*/ };\n$if (cond) { /*...*/ } $elif (cond2) { /*...*/ } $else { /*...*/ };\n\n$while (cond) { /*...*/ };\n$for (variable, n) { /*...*/ };\n$for (variable, begin, end) { /*...*/ };\n$for (variable, begin, end, step) { /*...*/ };\n$loop { /*...*/ }; // infinite loop, unless $break'ed\n\n$switch (variable) {\n    $case (value) { /*...*/ }; // no $break needed inside, as we automatically add one\n    $default { /*...*/ };      // no $break needed inside, as we automatically add one\n};\n\n$break;\n$continue;\n```\n\nNote that users are still able to use the *native* C++ control flows, i.e., `if`, `while`, etc. *without* the `$` prefix. In that case the *native* control flows acts like a *meta-stage* to the DSL that directly controls the generation of the callables/kernels. This can be a powerful means to achieve *multi-stage programming* patterns. Such usages can be found throughout [LuisaRender](https://github.com/LuisaGroup/LuisaRender). We will cover such usage in the tutorials in the future.\n\n### Callable and Kernels\n\nLuisaCompute supports two categories of device functions: `Kernel`s (`Kernel1D`, `Kernel2D`, or `Kernel3D`) and `Callable`s. Kernels are entries to the parallelized computation tasks on the device (equivalent to CUDA's `__global__` functions). Callables are function objects invocable from kernels or other callables (i.e., like CUDA's `__device__` functions). Both kinds are template classes that are constructible from C++ functions or function objects including lambda expressions:\n\n```cpp\n// Define a callable from a lambda expression\nCallable add_one = [](Float x) { return x + 1.f; };\n\n// A callable may invoke another callable\nCallable add_two = [\u0026add_one](Float x) {\n    add_one(add_one(x));\n};\n\n// A callable may use captured device resources or resources in the argument list\nauto buffer = device.create_buffer\u003cfloat\u003e(...);\nCallable copy = [\u0026buffer](BufferFloat buffer2, UInt index) {\n    auto x = buffer.read(index); // use captured resource\n    buffer2.write(index, x);     // use declared resource in the argument list\n};\n\n// Define a 1D kernel from a lambda expression\nKernel1D add_one_and_some = [\u0026buffer, \u0026add_one](Float some, BufferFloat out) {\n    auto index = dispatch_id().x;    // query thread index in the whole grid with built-in dispatch_id()\n    auto x = buffer.read(index);     // use resource through capturing\n    auto result = add_one(x) + some; // invoke a callable\n    out.write(index, result);        // use resource in the argument list\n};\n```\n\n\u003e ⚠️ Note that parameters of the definition functions for callables and kernels must be `Var\u003cT\u003e` or `Var\u003cT\u003e \u0026` (or their aliases).\n\nKernels can be compiled into shaders by the device:\n```cpp\nauto some_shader = device.compile(some_kernel);\n```\n\n\u003e ⚠️ Note that the compilation blocks the calling thread. For large kernels this might take a considerably long time. You may accelerate the process by compiling multiple kernels concurrently, e.g., with thread pools.\n\nMost backends support caching the compiled shaders to accelerate future compilations of the same shader. The cache files are at `\u003cbuild-folder\u003e/bin/.cache`.\n\n### Backends, Context, Devices and Resources\u003ca name=\"devices-and-resources\"/\u003e\n\nLuisaCompute currently supports these backends:\n- CUDA\n- DirectX\n- Metal\n- CPU (Clang + LLVM)\n\nMore backends might be added in the future. A device backend is implemented as a plug-in, which follows the `lc-backend-\u003cname\u003e` naming convention and is placed under `\u003cbuild-folder\u003e/bin`.\n\nThe `Context` object is responsible for loading and managing these plug-ins and creating/destroying devices. Users have to pass the executable path (typically, `argv[0]`) or the runtime directory to a context's constructor (so that it's able to locate the plug-ins), and pass the backend name to create the corresponding device object.\n```cpp\nint main(int argc, char *argv[]) {\n    Context context{argv[0]};\n    Device device = context.create_device(\"cuda\");\n    /* ... */\n}\n```\n\n\u003e ⚠️ Creating multiple devices inside the same application is allowed. However, the resources are not shared across devices. Visiting one device's resources from another device's commands/shaders would lead to undefined behaviors.\n\n\nThe device object provides methods for backend-specific operations, typicall, creating resources. LuisaCompute supports the following rousource types:\n\n- `Buffer\u003cT\u003e`s, which are linear memory ranges on the device for structured data storage;\n- `Image\u003cT\u003e`s and `Volume\u003cT\u003e`s, which are 2D/3D textures of scalars or vectors readable and writable from the shader, possibly with hardware-accelerated caching and format conversion;\n- `BindlessArray`s, which provide slots for references to buffers and textures (`Image`s or `Volume`s bound with texture samplers, read-only in the shader), helpful for reducing the overhead and bypassing the limitations of binding shader parameters;\n- `Mesh`es and `Accel`s (short for acceleration structures) for high-performance ray intersection tests, with hardware acceleration if available (e.g., on graphics cards that feature RT-Cores);\n\n\u003cimg alt=\"hardware_resources\" src=\"https://user-images.githubusercontent.com/7614925/196001295-a5407f09-77a0-461a-ab23-ab768ddc08e9.jpg\" align=\"center\" width=\"65%\"/\u003e\n\n\nDevices are also responsible for\n- Creating `Stream`s and `Event`s (the former are for command submission and the latter are for host-stream and stream-stream synchronization); and\n- Compiling kernels into shaders, as introduced before.\n\n\nAll resources, shaders, streams, and events are C++ objects with *move* contrutors/assignments and following the *RAII* idiom, i.e., automatically calling the `Device::destroy_*` interfaces when destructed.\n\n\u003e ⚠️ Users may need to pay attention not to dangle a resource, e.g., accidentally releases it before the dependent commands finish.\n\n### Command Submission and Synchronization\n\nLuisaCompute adopts the explicit command-based execution model. Conceptually, commands are description units of atomic computation tasks, such as transferring data between the device and host, or from one resource to another; building meshes and acceleration structures; populating or updating bindless arrays; and most importantly, launching shaders.\n\nCommands are organized into command buffers and then submitted to streams which are essentially queues forwarding commands to the backend devices in a logically first-in-first-out (FIFO) manner.\n\nThe resource wrappers provide convenient methods for creating commands, e.g.,\n```cpp\nauto buffer_upload_command   = buffer.copy_from(host_data)\nauto accel_build_command     = accel.build();\nauto shader_dispatch_command = shader(args...).dispatch(n);\n```\nCommand buffers are group commands that are submitted together:\n```cpp\nauto command_buffer = stream.command_buffer();\ncommand_buffer\n    \u003c\u003c raytrace_shader(framebuffer, accel, resolution)\n        .dispatch(resolution)\n    \u003c\u003c accumulate_shader(accum_image, framebuffer)\n        .dispatch(resolution)\n    \u003c\u003c hdr2ldr_shader(accum_image, ldr_image)\n        .dispatch(resolution)\n    \u003c\u003c ldr_image.copy_to(host_image.data())\n    \u003c\u003c commit(); // the commands are submitted to the stream together on commit()\n```\n\nFor convenience, a stream implicitly creates a proxy object, which submit commands in the internal command buffer at the end of statements:\n```cpp\nstream \u003c\u003c buffer.copy_from(host_data) // a stream proxy is created on Stream::operator\u003c\u003c()\n       \u003c\u003c accel.build()               // consecutive commands are stored in the implicit commad buffer in the proxy object\n       \u003c\u003c raytracing(image, accel, i)\n           .dispatch(width, height);  // the proxy object automatically submits the commands at the end of the statement\n```\n\n\u003e ⚠️ Since commands are asynchronously executed, users should pay attention to resource and host data lifetimes.\n\nThe backends in LuisaCompute can automatically determine the dependencies between the commands in a command buffer, and re-schedule them into an optimized order to improve hardware ultilization. Therefore, larger command buffers might be preferred for better computation throughput.\n\n\u003cimg alt=\"command scheduling\" src=\"https://user-images.githubusercontent.com/7614925/196001465-2dace78b-5e3b-4b4b-b2c3-f2cd61adc6ff.jpg\" align=\"center\" width=\"60%\"/\u003e\n\nMultiple streams run concurrently. Therefore, users may require synchronizations between them or with respect to the host via `Event`s, similar to condition variables that ensure ordering across threads:\n```cpp\nauto event = device.create_event();\nstream_a \u003c\u003c command_a\n         \u003c\u003c event.signal(); // signals an event\nstream_b \u003c\u003c event.wait()    // waits until the event signals\n         \u003c\u003c command_b;      // will be executed after the event signals\n         \u003c\u003c event.signal(); // signals again\nevent.synchronize();        // blocks until the event signals\n```\n### Automatic Differentiation\nWe implemented reverse mode autodiff using source-to-source transformation. The autodiff supports control flows such as if-else and switch, as well as callables. The following example shows how to use the autodiff to compute the gradient of a function `f(t, x, y) = t \u003c 1 ? x * y : x + y` with respect to `x` and `y`:\n```cpp\nVar\u003cfloat\u003e x = ...;\nVar\u003cfloat\u003e y = ...;\nVar\u003cfloat\u003e t = ...;\n$autodiff {\n    requires_grad(x, y);\n    Var\u003cfloat\u003e z;\n    $if(t \u003c 1.0) {\n        auto no_grad = some_non_differentiable_function(x, y);\n        z = x * y;\n    }$else {\n        z = callable(x, y);\n    };\n    backward(z);\n    dx-\u003ewrite(tid, grad(x));\n    dy-\u003ewrite(tid, grad(y));\n};\n```\n\nLimitation (might be removed in the future): \n- we don't support loop with dynamic iteration count. To differentiate a loop, users have to unroll it by using `for(auto i = 0;i \u003ccount;i++) { dsl_body(i); }`.  \n\n## Applications\n\nWe implement several proof-of-concept examples in tree under `src/tests` (sorry for the misleading naming; they are also test programs we used during the development). Besides, you may also found the following applications interesting:\n\n- [LuisaRender](https://github.com/LuisaGroup/LuisaRender.git), a high-performance cross-platform Monte Carlo renderer.\n- [LuisaShaderToy](https://github.com/LuisaGroup/LuisaShaderToy.git), a collection of amazing shaders ported from [Shadertoy](https://www.shadertoy.com).\n\n## Documentation and Tutorials\n\nSorry that we are still working on them. Currently, we would recommand reading the original [paper](https://luisa-render.com) and learning through the examples and applications.\n\nIf you have any problem or suggestion, please just feel free to open an [issue](https://github.com/LuisaGroup/LuisaCompute/issues) or start a [discussion](https://github.com/LuisaGroup/LuisaCompute/discussions). We are very happy to hear from you!\n\n## Roadmap\n\nSee [ROADMAP.md](ROADMAP.md).\n\n\n## Citation\n\n```bibtex\n@article{Zheng2022LuisaRender,\n    author = {Zheng, Shaokun and Zhou, Zhiqian and Chen, Xin and Yan, Difei and Zhang, Chuyan and Geng, Yuefeng and Gu, Yan and Xu, Kun},\n    title = {LuisaRender: A High-Performance Rendering Framework with Layered and Unified Interfaces on Stream Architectures},\n    year = {2022},\n    issue_date = {December 2022},\n    publisher = {Association for Computing Machinery},\n    address = {New York, NY, USA},\n    volume = {41},\n    number = {6},\n    issn = {0730-0301},\n    url = {https://doi.org/10.1145/3550454.3555463},\n    doi = {10.1145/3550454.3555463},\n    journal = {ACM Trans. Graph.},\n    month = {nov},\n    articleno = {232},\n    numpages = {19},\n    keywords = {stream architecture, rendering framework, cross-platform renderer}\n}\n```\n\nThe [publisher](https://doi.org/10.1145/3550454.3555463) version of the paper is open-access. You may download it for free.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLuisaGroup%2FLuisaCompute","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLuisaGroup%2FLuisaCompute","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLuisaGroup%2FLuisaCompute/lists"}