{"id":19817160,"url":"https://github.com/nvpro-samples/vk_timeline_semaphore","last_synced_at":"2026-01-28T05:23:40.208Z","repository":{"id":86083490,"uuid":"430967485","full_name":"nvpro-samples/vk_timeline_semaphore","owner":"nvpro-samples","description":"Vulkan timeline semaphore + async compute performance sample","archived":false,"fork":false,"pushed_at":"2024-06-28T09:56:08.000Z","size":9764,"stargazers_count":29,"open_issues_count":0,"forks_count":2,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-06-04T05:13:10.924Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"GLSL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nvpro-samples.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-23T05:11:51.000Z","updated_at":"2025-04-04T02:17:29.000Z","dependencies_parsed_at":"2024-06-28T11:21:10.342Z","dependency_job_id":null,"html_url":"https://github.com/nvpro-samples/vk_timeline_semaphore","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nvpro-samples/vk_timeline_semaphore","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvpro-samples%2Fvk_timeline_semaphore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvpro-samples%2Fvk_timeline_semaphore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvpro-samples%2Fvk_timeline_semaphore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvpro-samples%2Fvk_timeline_semaphore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nvpro-samples","download_url":"https://codeload.github.com/nvpro-samples/vk_timeline_semaphore/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvpro-samples%2Fvk_timeline_semaphore/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28840089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-28T02:10:51.810Z","status":"ssl_error","status_checked_at":"2026-01-28T02:10:50.806Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T10:11:51.613Z","updated_at":"2026-01-28T05:23:40.186Z","avatar_url":"https://github.com/nvpro-samples.png","language":"GLSL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vk_timeline_semaphore\n\nThis sample provides a concrete example of how timeline semaphores and\nasynchronous compute-only queues can be used to speed up a\nheterogeneous compute/graphics Vulkan application.\n\n# Build and Run\n\nClone https://github.com/nvpro-samples/nvpro_core.git next to this\nrepository (or pull latest `master` if you already have it)\n\n`mkdir build \u0026\u0026 cd build \u0026\u0026 cmake .. # Or use CMake GUI`\n\nIf there are missing dependencies (e.g. glfw), run `git submodule\nupdate --init --recursive --checkout --force` in the `nvpro_core`\nrepository.\n\nThen start the generated `.sln` in VS or run `make -j`.\n\nRun `vk_timeline_semaphore` or\n`../../bin_x64/Release/vk_timeline_semaphore.exe`\n\n# Timeline Semaphore Summary\n\n*Please skip this section if you are already familiar with timeline\n semaphores and their benefits over binary semaphores.*\n\n[`VK_KHR_timeline_semaphore`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_timeline_semaphore.html)\nintroduces a new type of semaphore that has more functionality over\nthe default \"binary\" semaphore introduced in the original Vulkan. This\nfeature is core in Vulkan 1.2, although it requires the\n[`timelineSemaphore`\nfeature](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPhysicalDeviceVulkan12Features.html).\n\nThe original binary semaphore has only two states: signaled and\nunsignaled, and two possible operations: signalling it upon the\ncompletion of a queue submit, and waiting on it (device-side) before\nstarting a pending queue submit (implicitly resetting the semaphore to\nthe unsignaled state). This causes some limitations:\n\n* If the user recycles a binary semaphore, they should take care that\n  the semaphore strictly alternates between being signalled and being\n  waited on (i.e. that the producer does not \"overshoot\" and signal\n  the semaphore again before the consumer waits and resets the\n  semaphore).\n\n* The implicit unsignal operation makes it impossible to use a single\n  binary semaphore to unblock two queues at once (e.g. graphics and\n  compute both waiting for the same transfer command to finish).\n\nFurthermore, the spec requires that before submitting any work that\nwaits on a binary semaphore, the work that signals said semaphore must\nbe submitted first.  These limitations generally have the consequence\nof making it cumbersome to implement fine-grained dependencies with\nbinary semaphores.\n\nInstead of only having 1 bit of state, timeline semaphores have an\ninternal 64-bit unsigned integer counter. Each signal and wait\noperation requires an additional 64-bit unsigned parameter, which\n\n* for signals, indicates the new value to set the counter to, subject\n  to the restriction that this value must be strictly greater than\n  the semaphore's value at the time the signal operation executes.\n\n* for waits, indicates the *minimum value* of the timeline semaphore\n  such that the waiting operation may proceed.\n\nThis makes timeline semaphores a natural choice for expressing\nfine-grained producer/consumer dependencies: because the wait only\nrequires a minimum value to proceed, and the value strictly increases,\nthere is no \"overshoot\" risk of the kind that binary semaphores have.\n\nFurther features of timeline semaphores include:\n\n* Timeline semaphores may also be\n  [signalled](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkSignalSemaphoreKHR.html)\n  and [waited\n  on](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkWaitSemaphoresKHR.html)\n  by the host, taking the place of fences and events.\n\n* Timeline semaphores do not have any restrictions on the order that\n  dependent work is submitted. Of course, you still risk resetting the\n  GPU if you introduce circular dependencies, or introduce an extreme\n  delay between submitting dependent work, and submitting its\n  dependencies.\n\nUnfortunately, at the time of writing, timeline semaphores cannot be\nused to synchronize with the swapchain. Thus most Vulkan programs\ncannot standardize completely on timeline semaphores.\n\n# Motivation\n\nAs a stand-in for the sort of heterogenous work a Vulkan application\nmay need to synchronize, the sample implements an approximate\n[implicit surface\nrenderer](https://en.wikipedia.org/wiki/Implicit_surface) using the\n[marching cubes\nalgorithm](https://en.wikipedia.org/wiki/Marching_cubes). Understanding\nthe details of the algorithm is not neccessary to understand the\nsynchronization this is demonstrating; in summary, the steps are:\n\n1. Generating a 3D sample grid of values by evaluating a scalar\n   equation `F(x,y,z)` at each coordinate. This is done by compute\n   shader, with the results stored to a `VkImage` with\n   `VK_IMAGE_TYPE_3D`.\n\n2. Generating a vertex buffer by having a compute shader analyze each\n   cubical \"cell\" of 8 neighboring samples. The compute shader outputs\n   triangles for each cell where samples transition from positive to\n   negative, thus approximating the implicit `F(x,y,z)=0` boundary\n   between the positive and negative regions.\n\n3. Using a graphics pipeline to draw the generated triangles.\n\nSynchronization is needed between each step. Furthermore, this\nalgorithm is very memory intensive – both the 3D image and vertex\nbuffer take up huge amounts of VRAM – so to render at a good\nresolution, we will have to split the model into chunks, and perform\nthese steps separately for each chunk. This requires an additional\nsynchronization from step 3 to step 1, so that the computed results\naren't overwritten before they are fully drawn (i.e. to resolve the\n[WAR\nhazard](https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_after_read_(WAR))).\n\n*Visualization of model split into 4×4×4 grid of chunks*\n![Visualization of model split into 4×4×4 grid of\n chunks](./docs/chunk_bounds.png)\n\nOne way to implement this would be to submit commands for all three\nsteps to a single GCT queue (graphics/compute/transfer), inserting\npipeline barriers for synchronization. The downsides to this approach\nare:\n\n* Since there is only one queue being used, the device work completely\n  drains at each barrier, with no alternative work to do to alleviate\n  this waste.\n\n* We will see later on that the graphics work in this sample is very\n  light in terms of SM usage; the device is seriously underutilized\n  when working solely on step 3.\n\nA better alternative would be to move steps 1 and 2 to a dedicated compute\nqueue, and use timeline semaphores to handle the RAW hazard (steps 2 to 3)\nand WAR hazard (steps 3 to 1). The theoretical benefits of this are\n\n* We still have a pipeline barrier from step 1 to step 2; the dip in\n  compute utilization will not completely bring the device to 0\n  utilization since the separate graphics queue's work can \"fill in\n  the gap\" (although the lightness of this work somewhat limits this\n  effect).\n\n* The compute work can proceed in parallel with the graphics work and\n  take advantage of the SMs that the graphics work underutilizes.\n\nLater we will see how these theoretical benefits translate to actual\nbetter performance.\n\n## Implementation\n\nIn the code, the data structure that holds the 3D image and vertex\nbuffer used for communicating between the steps of the marching cubes\nalgorithm is called `McubesChunk`, implemented in `mcubes_chunk.cpp`.\nWe allocate on startup a pool of `MCUBES_CHUNK_COUNT`-many such\nstructures, held in `g_mcubesChunkArray`. We treat this array as a\n[ring buffer](https://en.wikipedia.org/wiki/Circular_buffer), picking\nthe next in the buffer each time we record a new command to fill or\ndraw an `McubesChunk` instance.\n\nThe central function to look at for this sample is\n`computeDrawCommandsTwoQueues` in `timeline_semaphore_main.cpp`, which\ndoes the task of recording and submitting the compute and graphics\ncommands for a frame. To keep the focus on timeline semaphores, the\ndetails for these commands are moved to `compute.cpp` and\n`graphics.cpp`.\n\nWe've allocated two timeline semaphores:\n`s_computeDoneTimelineSemaphore`, for resolving the RAW hazard\n(compute queue produced vertex buffer → graphics queue draw) and\n`s_graphicsDoneTimelineSemaphore` for resolving the WAR hazard\n(graphics queue done drawing → compute queue refills). This is done\nwith the same `vkCreateSemaphore` API, but with an additional\n[`VkSemaphoreTypeCreateInfo`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkSemaphoreTypeCreateInfo.html)\non the `pNext` chain.\n\n    // Allocate the timeline semaphores; initial value 0. Need extension struct for this.\n    VkSemaphoreTypeCreateInfo timelineSemaphoreInfo = {VK_STRUCTURE_TYPE_SEMAPHORE_TYPE_CREATE_INFO, nullptr,\n                                                       VK_SEMAPHORE_TYPE_TIMELINE, 0};\n    VkSemaphoreCreateInfo     semaphoreInfo         = {VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO, \u0026timelineSemaphoreInfo};\n    NVVK_CHECK(vkCreateSemaphore(g_ctx, \u0026semaphoreInfo, nullptr, \u0026s_computeDoneTimelineSemaphore));\n    NVVK_CHECK(vkCreateSemaphore(g_ctx, \u0026semaphoreInfo, nullptr, \u0026s_graphicsDoneTimelineSemaphore));\n\n**NOTE:** [Read this file alone to eliminate horizontal scrolling.](https://github.com/nvpro-samples/vk_timeline_semaphore/blob/main/README.md#Implementation)\n\nThese semaphores are passed as parameters of a\n[`VkQueueSubmit`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkQueueSubmit.html)\nin the same way as binary semaphores, except that there is an\nadditional\n[`VkTimelineSemaphoreSubmitInfo`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkTimelineSemaphoreSubmitInfo.html)\nextension struct to provide the array of 64-bit counter value parameters.\n\n    uint64_t                      computeWaitTimelineValue = 0, computeSignalTimelineValue = 0;\n    uint64_t                      graphicsWaitTimelineValue = 0, graphicsSignalTimelineValue = 0;\n    VkTimelineSemaphoreSubmitInfo computeTimelineInfo = {\n        VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO,\n        nullptr,\n        1,\n        \u0026computeWaitTimelineValue,  // Compute queue waits for /at least/ this timeline semaphore value of\n        1,                          // s_graphicsDoneTimelineSemaphore (semaphore set below).\n        \u0026computeSignalTimelineValue};\n    // ...\n    // analagous for graphicsTimelineInfo\n    VkSubmitInfo computeSubmitInfo  = {VK_STRUCTURE_TYPE_SUBMIT_INFO,\n                                      \u0026computeTimelineInfo,  // Extension struct\n                                      1,\n                                      \u0026s_graphicsDoneTimelineSemaphore,  // Compute waits for graphics queue\n                                      \u0026computeStage,                     // Waits for semaphore before starting compute\n                                      1,\n                                      \u0026batchComputeCmdBuf,\n                                      1,\n                                      \u0026s_computeDoneTimelineSemaphore};\n\nOnce the model is subdivided into chunks, the list of chunks to draw\nis divided into batches of up to `pGui-\u003em_batchSize` chunks each. This\nis the amount of work that one compute or one graphics command buffer\nwill do. Each batch compute/graphics submit increments the values of\nthe corresponding timeline semaphore (orchestrated by the\n`s_upcomingTimelineValue` variable), so, the batch size in some sense\nis the granularity of synchronization.\n\nThe pattern for the RAW synchronization (compute → graphics) is\nstraightforward: the graphics command buffer submitted for a batch\nwaits on the `s_computeDoneTimelineSemaphore` value set by the compute\ncommand buffer for the same batch. So, the graphics work for a batch\ncan proceed immediately after the compute work for the batch is\ncomplete.\n\n    computeSignalTimelineValue = s_upcomingTimelineValue;\n    NVVK_CHECK(vkQueueSubmit(g_computeQueue, 1, \u0026computeSubmitInfo, VkFence{}));\n    // Recall computeSignalTimelineValue is pointed-to by VkTimelineSemaphoreSubmitInfo, above.\n    // ...\n    graphicsWaitTimelineValue = s_upcomingTimelineValue;\n    // ...\n    NVVK_CHECK(vkQueueSubmit(g_gctQueue, 1, \u0026graphicsSubmitInfo, VkFence{}));\n\nThe pattern for the WAR synchronization (graphics → compute) is a bit\nmore interesting. We need to record, for each `McubesChunk`, when it\nwill be fully drawn and ready for recycling. So, each time an\n`McubesChunk` is used in a graphics batch, we record the value that\n`s_graphicsDoneTimelineSemaphore` will be set to by the signal\noperation of that batch. This is stored in\n`McubesChunk::timelineValue`.\n\nThen, when recording the newer compute batch, for each `McubesChunk`\nselected for use, we have to wait for\n`s_graphicsDoneTimelineSemaphore` to reach the value recorded in\n`McubesChunk::timelineValue`; more optimally, we just have to wait for\nthe maximum `McubesChunk::timelineValue` of all `McubesChunk`\ninstances selected.\n\n*Setting `McubesChunk::timelineValue`*\n\n    // Graphics commands.\n    for(uint32_t localIndex = 0, paramIndex = batchStart; paramIndex \u003c batchEnd; ++paramIndex, ++localIndex)\n    {\n      // Record the s_graphicsDoneTimelineSemaphore value for this McubesChunk that indicates readiness for recycling.\n      chunkPointerArray[localIndex]-\u003etimelineValue = s_upcomingTimelineValue;\n    }\n    // ...\n    graphicsCmdDrawMcubesGeometryBatch(batchGraphicsCmdBuf, batchEnd - batchStart, chunkPointerArray, /* ... */);\n    // ...\n    graphicsSignalTimelineValue = s_upcomingTimelineValue;\n    NVVK_CHECK(vkQueueSubmit(g_gctQueue, 1, \u0026graphicsSubmitInfo, VkFence{}));\n\n*Waiting on maximum `McubesChunk::timelineValue` value*\n\n    // Record compute commands.\n    // We also keep track of the s_graphicsDoneTimelineSemaphore value that these compute commands\n    // need to wait on (to safely recycle the McubesChunk).\n    for(uint32_t localIndex = 0, paramIndex = batchStart; paramIndex \u003c batchEnd; ++paramIndex, ++localIndex)\n    {\n      computeWaitTimelineValue = std::max(computeWaitTimelineValue, chunkPointerArray[localIndex]-\u003etimelineValue);\n    }\n    computeCmdFillChunkBatch(batchComputeCmdBuf, batchEnd - batchStart, chunkPointerArray, \u0026paramsList[batchStart]);\n    // recall computeWaitTimelineValue is pointed-to by computeTimelineInfo (VkTimelineSemaphoreSubmitInfo).\n\nNOTE: We use `timelineValue = 0` as a safe value for `McubesChunk`\ninstances that have never been drawn as `x \u003e= 0` for all possible\nunsigned timeline semaphore values `x`.\n\nThe `s_computeDoneTimelineSemaphore` only handles the execution order\ndependency. We still need a pipeline barrier to handle the memory\ndependency (in hardware terms, flushing caches, etc.). This is the\ncommand executed in the *graphics* command buffer for each batch,\nbefore any `McubesChunk`-drawing commands.\n\n    // Ensure memory dependency resolved between upcoming compute command and upcoming graphics commands.\n    // This is separate from (and an additional requirement on top of) the execution dependency\n    // handled by the timeline semaphore.\n    // No queue ownership transfer -- using VK_SHARING_MODE_CONCURRENT.\n    VkMemoryBarrier computeToGraphicsBarrier = {VK_STRUCTURE_TYPE_MEMORY_BARRIER, nullptr, VK_ACCESS_SHADER_WRITE_BIT,\n                                                VK_ACCESS_INDIRECT_COMMAND_READ_BIT | VK_ACCESS_SHADER_READ_BIT};\n    vkCmdPipelineBarrier(batchGraphicsCmdBuf, computeStage, readGeometryArrayStage, 0, 1, \u0026computeToGraphicsBarrier, 0,\n                         0, 0, 0);\n\n*Note that this is not just a theoretical concern!* The author failed\n to include this barrier at first, and experienced sporadic\n corruption.\n\nWe do not need a memory barrier for the other direction (graphics to\ncompute WAR hazard) because we are going to overwrite the\n`McubesChunk` contents anyway.\n\n---\n\nFor comparison, an implementation using only one queue and pipeline\nbarriers is in the `computeDrawCommandsGctOnly` function in\n`timeline_semaphore_main.cpp`.\n\nNote that for both the `computeDrawCommandsTwoQueues` and\n`computeDrawCommandsGctOnly` code path, we are using a separate\n`submitFrame` function to do the task of acquiring/submitting swap\nchain images, and copying the drawn image to the swap chain. As\nmentioned before, this must still use binary semaphores, and hence,\nthe programmer must follow its in-order submisison requirements and\nensure that all command buffers for the frame have already been\nsubmitted. This is simple for this sample, but more care would need to\nbe taken if some command recording were moved to another thread.\n\n---\n\nThere are some debug views for visualizing the chunking and batching\nalgorithm.\n\nFirst, we can visualize how chunks are packed into batches; all chunks\nin a given batch use the same color: ![chunks colored by\nbatch](docs/batch_size_6.png)\n\nContrast with the view if the batch size is reduced to 2 chunks:\n![chunks colored by batch](docs/batch_size_2.png)\n\nSecond, we can color based on the physical `McubesChunk` instance used\nto draw each chunk. Each instance is assigned a unique color: ![chunks\ncolored by McubesChunk instance used](docs/color_by_McubesChunk.png)\n\nObserve how multiple different chunks can be in the same color. This\nindicates that we've successfully recycled `McubesChunk` instances in\na single frame, and the lack of corruption suggests the data hazards\nwere properly resolved.\n\n## Results\n\nOn a release build running on Ubuntu 18.04 and an RTX 3090, we are getting\naround 230 FPS with the default settings, compared to around 200 FPS with\nthe async compute queue disabled (toggle with `c` key).\n\nWe can take a look at what's really going on using [Nsight Graphics\nGPU\nTrace](https://docs.nvidia.com/nsight-graphics/UserGuide/#gpu_trace_0).\n\nWith the async compute queue **disabled**, we see:\n![GCT-queue only GPU Trace](./docs/gpu_trace_gct_only.png)\n\nPaying attention to the \"SM Occupancy\" row, we see that there are\nperiodic dips in utilization, corresponding to the graphics work of\neach batch.\n\n*Color Key for SM Occupancy*\n![Color key for SM Occupancy](./docs/gpu_trace_key.png)\n\nWhen we enable the async compute queue, we see:\n![GCT-queue only GPU Trace](./docs/gpu_trace_two_queues.png)\n\nThe SM occupancy in this case is not perfect either, but is a serious\nimprovement over using only one queue: we can see that the (orange)\ncompute work is proceeding more smoothly, and has successfully been\noverlapped with the (blue) graphics shader work.\n\n*Note:* On some older versions of Nsight Graphics, timeline semaphore\n synchronization is mislabelled as fence synchronization.\n\n## Acknowledgement\n\nThank you to Christoph Kubisch for spotting the missing memory\nbarrier, and Nia Bickford for edits and review.\n\n## License\n\nCopyright 2021 NVIDIA CORPORATION. Released under Apache License,\nVersion 2.0. See \"LICENSE\" file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvpro-samples%2Fvk_timeline_semaphore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvpro-samples%2Fvk_timeline_semaphore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvpro-samples%2Fvk_timeline_semaphore/lists"}