{"id":20956755,"url":"https://github.com/mitsuba-renderer/drjit-core","last_synced_at":"2025-04-05T22:04:34.312Z","repository":{"id":46748850,"uuid":"245897378","full_name":"mitsuba-renderer/drjit-core","owner":"mitsuba-renderer","description":"Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering (core library)","archived":false,"fork":false,"pushed_at":"2025-03-31T14:58:31.000Z","size":4436,"stargazers_count":91,"open_issues_count":4,"forks_count":19,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-04-02T04:50:50.738Z","etag":null,"topics":["cuda","jit","llvm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mitsuba-renderer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-08T22:21:54.000Z","updated_at":"2025-03-31T14:32:09.000Z","dependencies_parsed_at":"2023-10-11T10:46:24.638Z","dependency_job_id":"c458912c-3756-46c9-a804-c53e7595b24e","html_url":"https://github.com/mitsuba-renderer/drjit-core","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitsuba-renderer%2Fdrjit-core","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitsuba-renderer%2Fdrjit-core/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitsuba-renderer%2Fdrjit-core/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitsuba-renderer%2Fdrjit-core/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mitsuba-renderer","download_url":"https://codeload.github.com/mitsuba-renderer/drjit-core/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247406085,"owners_count":20933803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","jit","llvm"],"created_at":"2024-11-19T01:27:47.128Z","updated_at":"2025-04-05T22:04:34.296Z","avatar_url":"https://github.com/mitsuba-renderer.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/mitsuba-renderer/drjit-core/raw/master/resources/drjit-logo-dark.svg#gh-light-mode-only\" alt=\"Dr.Jit logo\" width=\"500\"/\u003e\n\u003cimg src=\"https://github.com/mitsuba-renderer/drjit-core/raw/master/resources/drjit-logo-light.svg#gh-dark-mode-only\" alt=\"Dr.Jit logo\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n# Dr.Jit-Core — A tracing GPU/CPU Just-In-Time-Compiler\n\n| Continuous Integration |\n|         :---:          |\n|   [![rgl-ci][1]][2]    |\n\n[1]: https://rgl-ci.epfl.ch/app/rest/builds/aggregated/strob:(buildType:(project:(id:DrJitCore)))/statusIcon.svg\n[2]: https://rgl-ci.epfl.ch/project/DrJitCore?mode=trends\u0026guest=1\n\n\n## Introduction\n\nThis repository contains an efficient and self-contained *just-in-time* (JIT)\ncompiler that can vectorize and parallelize computation. It was designed to to\naccelerate differentiable Monte Carlo rendering that requires dynamic\ncompilation of large amounts of derivative code, though other types of\nembarrassingly parallel computation are likely to benefit as well.\n\nThis library exposes a C and C++ interface that can be used to *trace*\ncomputation, which means that the system internally builds a graph\nrepresentation of all steps while postponing their evaluation for as long as\npossible. When the traced computation is finally evaluated, the system fuses\nall operations into an efficient kernel containing queued computation that is\nasynchronously evaluated on a desired device. On the CPU, this involves\ncompilation of vectorized [LLVM IR](https://llvm.org/docs/LangRef.html) and\nparallel execution using a thread pool, while GPU compilation involves [NVIDIA\nPTX](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html), and\neither [CUDA](https://docs.nvidia.com/cuda) or\n[OptiX](https://developer.nvidia.com/optix) depending on whether or not ray\ntracing operations are used.\n\nThis project can be used independently or as part of the larger\n[Dr.Jit](https://github.com/mitsuba-renderer/drjit) project, which furthermore\nprovides support for automatic differentiation, multidimensional\narrays/tensors, and a large library of mathematical functions.\n\nThe Dr.Jit-Core library has almost no dependencies: it can be compiled without\nCUDA, OptiX, or LLVM actually being present on the system (it will attempt to\nfind them at runtime as needed). The library is implemented in C++14 but\nexposes all functionality through a C99-compatible interface.\n\n## Features\n\nDr.Jit has the following features:\n\n- Runs on Linux (`X86_64`), macOS (`x86_64` \u0026 `aarch64`), and Windows\n  (`x86_64`). Other platforms may work as well but have not been tested.\n\n- Targets\n\n    1. NVIDIA GPUs via CUDA (compute capability 5.0 or newer), and\n\n    2. CPUs via LLVM leveraging available vector instruction set extensions\n       (e.g. Neon or AVX/AVX2/AVX512).\n\n- Captures and compiles pure arithmetic, side effects, and higher-level\n  operations (loops and dynamic method dispatch) that are preserved 1:1 in\n  generated kernels.\n\n- Performs several basic optimizations to reduce the amount of LLVM/PTX IR\n  passed to the next compiler stage.\n\n  - Dead code elimination\n  - Constant propagation\n  - Common subexpression elimination via local value numbering\n\n- Supports parallel kernel execution on multiple devices (JITing from several\n  CPU threads, or running kernels on multiple GPUs).\n\n- Provides a fast caching memory allocator that operates in the execution stream of\n  an asynchronous computation device. This addresses a common performance bottleneck.\n\n- Caches and reuses kernels when the same computation is encountered again.\n  Caching is done both in memory and on disk (``~/.drjit`` on Linux and macOS,\n  ``~/AppData/Local/Temp/drjit`` on Windows).\n\n- Provides a variety of parallel reductions for convenience.\n\n## An example (C++)\n\nThe header file\n[drjit-core/array.h](https://github.com/mitsuba-renderer/drjit-core/blob/master/include/drjit-core/array.h)\nprovides a convenient C++ wrapper with operator operator overloading building\non the C-level API\n([drjit-core/jit.h](https://github.com/mitsuba-renderer/drjit-core/blob/master/include/drjit-core/jit.h)).\nHere is an brief example on how it can be used:\n\n```cpp\n#include \u003cdrjit-core/array.h\u003e\n\nusing Bool   = CUDAArray\u003cbool\u003e;\nusing Float  = CUDAArray\u003cfloat\u003e;\nusing UInt32 = CUDAArray\u003cuint32_t\u003e;\n```\n\nThe above snippet sets up a group of \"capitalized\" types that invoke the JIT\ncompiler. Any arithmetic involving instances of such types, e.g.,\n\n```cpp\nUInt32 c, a = /* .. */, b = /* .. */;\n\nc = (a + b) * 5;\n```\n\nwill conceptually expand to a parallel loop that processes the individual array\nelements, e.g.,\n\n```cpp\nfor (int i = 0; i \u003c array_size; ++i) /* in parallel */ {\n    uint32_t tmp0 = a[i] + b[i];\n    c[i] = tmp0 * 5;\n}\n```\n\nThis evaluation of this loop is decoupled from the original program—in effect,\nrunning the program _decides_ what the contents of this loop should be.\n\nLet's look at a concrete example, using the previously defined types\n\n```cpp\n// Create a floating point array with 101 linearly spaced entries\n// [0, 0.01, 0.02, ..., 1]\nFloat x = linspace\u003cFloat\u003e(0, 1, 101);\n\n// [0, 2, 4, 8, .., 98]\nUInt32 index = arange\u003cUInt32\u003e(50) * 2;\n\n// Equivalent to \"y = x[index]\"\nFloat y = gather(x, index);\n\n/// Comparisons produce mask arrays\nBool mask = x \u003c .5f;\n\n// Ternary operator\nFloat z = select(mask, sqrt(x), 1.f / x);\n\nprintf(\"Value is = %s\\n\", z.str());\n```\n\nRunning this program will trigger two kernel launches. The first generates the\n``x`` array (size 100) when it is accessed by the ``gather()`` operation, and\nthe second generates ``z`` (size 50) when it is printed in the last line. Both\ncorrespond to points during the execution where evaluation could no longer be\npostponed, e.g., because of the cross-lane memory dependency in the former case.\n\nSimply changing the first lines to\n\n```cpp\n#include \u003cdrjit-core/llvm.h\u003e\n\nusing Bool   = LLVMArray\u003cbool\u003e;\nusing Float  = LLVMArray\u003cfloat\u003e;\nusing UInt32 = LLVMArray\u003cuint32_t\u003e;\n```\n\nswitches to the functionally equivalent LLVM backend. By default, the LLVM\nbackend parallelizes execution via a built-in thread pool, enabling usage that\nis very similar to the CUDA variant: a single thread issues computation that is\nthen processed in parallel by all cores of the system.\n\n## How it works\n\nTo understand a bit better how all of this works, we can pop one level down to\nthe C-level interface. The first operation ``jit_init`` initializes Dr.Jit\nand searches for LLVM and/or CUDA as instructed by the user. Note that users\ndon't need to install the CUDA SDK—just having an NVIDIA graphics driver is\nenough.\n\n```cpp\njit_init(JitBackendCUDA);\n```\n\nLet's calculate something: we will start by creating a single-precision\nfloating point variable that is initialized with the value ``0.5``. This\ninvolves the function ``jit_var_f32``, which creates a literal constant\nvariable that depends on no other variables.\n\n```cpp\nuint32_t v0 = jit_var_f32(JitBackendCUDA, .5f);\n);\n```\nThis is a *scalar* variable, which means that it will produce a\nsingle element if evaluated alone, but it can also occur in any computation\ninvolving larger arrays and will expand to the needed size.\n\nPrograms using Dr.Jit will normally create and destroy *vast* numbers of\nvariables, and this operation is therefore highly optimized. The operation\ncreates an entry in a [very efficient hash\ntable](https://github.com/Tessil/robin-map) mapping the resulting variable\nindex ``v0`` to a record ``(backend, type, \u003coperands\u003e)``. Over time, this hash\ntable will expand to the size that is needed to support the active computation,\nand from this point onward ``jit_var_..()`` operations will not involve any\nfurther dynamic memory allocation.\n\nLet's do some computation with this variable: we can create a \"counter\", which\nis an Dr.Jit array containing an increasing sequence of integer elements ``[0,\n1, 2, .., 9]`` in this case.\n\n```cpp\nuint32_t v1 = jit_var_counter(/* backend = */ JitBackendCUDA,\n                              /* size    = */ 10);\n```\nCounters always have the variable type ``VarTypeUInt32`` that we next\nconvert into a single precision floating point variable.\n\n```cpp\nuint32_t v2 = jit_var_cast(/* index       = */ v1,\n                           /* target_type = */ VarTypeFloat32,\n                           /* reinterpret = */ 0);\n```\n\nFinally, let's create a more interesting variable that references some of the\nprevious results via ``op0`` and ``op1``.\n\n```cpp\nuint32_t v3 = jit_var_add(v0, v2)\n```\nSuppose that we don't plan to perform any\nfurther computation / accesses involving ``v0``, ``v1``, and ``v2``. This must\nbe indicated to Dr.Jit by reducing their reference count.\n\n```cpp\njit_var_dec_ref(v0);\njit_var_dec_ref(v1);\njit_var_dec_ref(v2);\n```\n\nThey still have a nonzero *internal* reference count (i.e. by Dr.Jit itself)\nsince the variable ``v3`` depends on them, and this keeps them from being\ngarbage-collected.\n\nNote that no real computation has happened yet—so far, we were simply\nmanipulating hash table entries. Let's finally observe the result of this\ncalculation by printing the array contents:\n\n```cpp\nprintf(\"Result: %s\\n\", jit_var_str(v3));\n```\n\nThis step internally invokes ``jit_var_eval(v3)`` to evaluate the variable,\nwhich creates a CUDA kernel containing all steps that are needed to compute\nthe contents of ``v3`` and write them into device-resident memory.\n\nDuring this compilation step, the following happens: Dr.Jit first traverses\nthe relevant parts of the variable hash table and concatenates all string\ntemplates (with appropriate substitutions) into a complete PTX representation.\nThis step is highly optimized and takes on the order of a few microseconds.\n\nOnce the final PTX string is available, two things can happen: potentially\nwe've never seen this particular sequence of steps before, and in that case the\nPTX code must be further compiled to machine code (\"SASS\", or *streaming\nassembly*). This step involves a full optimizing compiler embedded in the GPU\ndriver, which tends to be very slow: usually it's a factor of 1000-10000×\nslower than the preceding steps within Dr.Jit.\n\nHowever, once a kernel has been compiled, Dr.Jit will *remember* it using\nboth an in-memory and an on-disk cache. In programs that perform the same\nsequence of steps over and over again (e.g. optimization), the slow PTX→SASS\ncompilation step will only occur in the first iteration. Evaluation of ``v2``\nwill turn the variable from a symbolic representation into a GPU-backed array,\nand further queued computation accessing it will simply index into that array\ninstead of repeating the original computation.\n\nAt the end of the program, we must not forget to decrease the reference count\nassociated with ``v2``, which will release the array from memory. Finally,\n``jit_shutdown()`` releases any remaining resources held by Dr.Jit.\n\n```cpp\njit_var_dec_ref(v3);\njit_shutdown(0);\n```\n\nRunning this program on a Linux machine provides the following output:\n\n```\njit_init(): creating directory \"/home/wjakob/.drjit\" ..\njit_init(): detecting devices ..\njit_cuda_init(): enabling CUDA backend (version 11.1)\n - Found CUDA device 0: \"GeForce RTX 3090\" (PCI ID 65:00.0, compute cap. 8.6, 82 SMs w/99 KiB shared mem., 23.7 GiB global mem.)\njit_eval(): launching 1 kernel.\n  -\u003e launching e93e70f12fcaea9c (n=10, in=0, out=1, ops=8, jit=2.4 us):\n     cache miss, build: 33.417 ms, 2.98 KiB.\njit_eval(): done.\nResult: [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]\njit_shutdown(): releasing 1 kernel ..\njit_shutdown(): releasing 1 thread state ..\njit_shutdown(): done\njit_cuda_shutdown()\n```\n\nThese log messages show that Dr.Jit generated a single kernel within 2.4 μs.\nHowever, this kernel was never observed before, necessitating a compilation\nstep by the CUDA driver, which took 33 ms.\n\nNote the ``Result: [...]`` line, which is the expected output of the\ncalculation ``0.5 + [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]``. The extra lines are debug\nstatements that can be controlled by setting the log level to a higher or lower\nlevel. A log callback can also be provided to e.g. route such messages to a\nfile.\n\nLet's actually increase the log level to see some more detail of what is\nhappening under the hood. This can be done by adding the following two lines at\nthe beginning of the program\n\n```cpp\njit_set_log_level_stderr(LogLevelDebug);\njit_set_flag(JitFlagPrintIR, 1);\n```\n\nThis produces the following detailed output (there is also ``LogLevelTrace``\nfor the truly adventurous):\n\n```\njit_init(): detecting devices ..\njit_cuda_init(): enabling CUDA backend (version 11.1)\n - Found CUDA device 0: \"GeForce RTX 3090\" (PCI ID 65:00.0, compute cap. 8.6, 82 SMs w/99 KiB shared mem., 23.7 GiB global mem.)\njit_var_new(float32 r1): mov.$t0 $r0, 0.5\njit_var_new(uint32 r2[10]): mov.u32 $r0, %r0\njit_var_new(float32 r3[10] \u003c- r2): cvt.rn.$t0.$t1 $r0, $r1\njit_var_cast(float32 r3 \u003c- uint32 r2)\njit_var_new(float32 r4[10] \u003c- r1, r3): add.$t0 $r0, $r1, $r2\njit_eval(): launching 1 kernel.\n  -\u003e launching e93e70f12fcaea9c (n=10, in=0, out=1, ops=8, jit=2.9 us):\njit_eval(): launching 1 kernel.\n.version 6.0\n.target sm_60\n.address_size 64\n\n.entry drjit_e93e70f12fcaea9cecd06e2b4b9ab180(.param .align 8 .b8 params[16]) {\n    .reg.b8   %b \u003c8\u003e; .reg.b16 %w\u003c8\u003e; .reg.b32 %r\u003c8\u003e;\n    .reg.b64  %rd\u003c8\u003e; .reg.f32 %f\u003c8\u003e; .reg.f64 %d\u003c8\u003e;\n    .reg.pred %p \u003c8\u003e;\n\n    mov.u32 %r0, %ctaid.x;\n    mov.u32 %r1, %ntid.x;\n    mov.u32 %r2, %tid.x;\n    mad.lo.u32 %r0, %r0, %r1, %r2;\n    ld.param.u32 %r2, [params];\n    setp.ge.u32 %p0, %r0, %r2;\n    @%p0 bra done;\n\n    mov.u32 %r3, %nctaid.x;\n    mul.lo.u32 %r1, %r3, %r1;\n\nbody: // sm_75\n    mov.f32 %f4, 0.5;\n    mov.u32 %r5, %r0;\n    cvt.rn.f32.u32 %f6, %r5;\n    add.f32 %f7, %f4, %f6;\n    ld.param.u64 %rd0, [params+8];\n    mad.wide.u32 %rd0, %r0, 4, %rd0;\n    st.global.cs.f32 [%rd0], %f7;\n\n    add.u32 %r0, %r0, %r1;\n    setp.ge.u32 %p0, %r0, %r2;\n    @!%p0 bra body;\n\ndone:\n    ret;\n}\n     cache hit, load: 69.195 us, 2.98 KiB.\njit_eval(): cleaning up..\njit_eval(): done.\njit_shutdown(): releasing 1 kernel ..\njit_shutdown(): releasing 1 thread state ..\njit_flush_malloc_cache(): freed\n - device memory: 64 B in 1 allocation\njit_shutdown(): done\njit_cuda_shutdown()\nResult: [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]\n```\n\nNote in particular the PTX fragments that includes the lines\n\n```\nL1: // Loop body\n    mov.f32 %f4, 0.5;\n    mov.u32 %r5, %r0;\n    cvt.rn.f32.u32 %f6, %r5;\n    add.f32 %f7, %f4, %f6;\n```\n\nThese lines exactly corresponding to the variables ``v0`` to ``v3``\nthat we had previously defined. The surrounding code establishes a [grid-stride\nloop](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/)\nthat processes all array elements. This time around, the kernel compilation was\nskipped, and Dr.Jit loaded the kernel from the on-disk cache file\n``~/.drjit/e93e70f12fcaea9cecd06e2b4b9ab180.cuda.bin`` containing a\n[LZ4](https://github.com/lz4/lz4)-compressed version of code and compilation\noutput. The odd hexadecimal value is simply the\n[XXH3](https://cyan4973.github.io/xxHash/) hash of the kernel source code.\n\nWhen a kernel includes OptiX function calls (ray tracing operations), kernels\nare automatically launched through OptiX instead of the CUDA driver API.\n\n### LLVM backend\n\nThe preceding section provided a basic example of Dr.Jit in combination with CUDA.\nLLVM works essentially the same way. Now, the ``backend=`` flag must be set to\n``JitBackendLLVM``.\n\nThe LLVM backend operates on vectors matching the SIMD instruction set of the\nhost processor such as AVX/AVX2/AVX512 or ARM NEON.\n\nA kernel transforming less than a few thousands of elements will be\nJIT-compiled and executed immediately on the current thread. For large arrays,\nDr.Jit will automatically parallelize evaluation via a thread pool. The\nrepository includes\n[nanothread](https://github.com/mitsuba-renderer/nanothread) as a git\nsubmodule, which is a minimal implementation of the components that are\nnecessary to realize this. The size of this thread pool can also be set to\nzero, in which case all computation will occur on the current thread. In this\ncase, another type of parallelism is available by using Dr.Jit from multiple\nthreads at once.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitsuba-renderer%2Fdrjit-core","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmitsuba-renderer%2Fdrjit-core","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitsuba-renderer%2Fdrjit-core/lists"}