{"id":13732073,"url":"https://github.com/GameTechDev/MaskedOcclusionCulling","last_synced_at":"2025-05-08T06:31:13.573Z","repository":{"id":85198061,"uuid":"61160195","full_name":"GameTechDev/MaskedOcclusionCulling","owner":"GameTechDev","description":"Example code for the research paper \"Masked Software Occlusion Culling\"; implements an efficient alternative to the hierarchical depth buffer algorithm.","archived":false,"fork":false,"pushed_at":"2023-12-11T18:59:17.000Z","size":617,"stargazers_count":600,"open_issues_count":8,"forks_count":77,"subscribers_count":54,"default_branch":"master","last_synced_at":"2024-08-04T02:10:45.709Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://software.intel.com/en-us/articles/masked-software-occlusion-culling","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GameTechDev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"license.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-06-14T22:33:47.000Z","updated_at":"2024-07-17T18:29:54.000Z","dependencies_parsed_at":"2023-03-03T22:45:42.550Z","dependency_job_id":null,"html_url":"https://github.com/GameTechDev/MaskedOcclusionCulling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GameTechDev%2FMaskedOcclusionCulling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GameTechDev%2FMaskedOcclusionCulling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GameTechDev%2FMaskedOcclusionCulling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GameTechDev%2FMaskedOcclusionCulling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GameTechDev","download_url":"https://codeload.github.com/GameTechDev/MaskedOcclusionCulling/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224707671,"owners_count":17356376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T02:01:45.521Z","updated_at":"2024-11-14T23:30:59.484Z","avatar_url":"https://github.com/GameTechDev.png","language":"C++","funding_links":[],"categories":["Graphics"],"sub_categories":[],"readme":"# MaskedOcclusionCulling\n\nThis code accompanies the research paper [\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling),\nand implements an efficient alternative to the hierarchical depth buffer algorithm. Our algorithm decouples depth values and coverage, and operates directly\non the hierarchical depth buffer. It lets us efficiently parallelize both coverage computations and hierarchical depth buffer updates.\n\n## Update May 2018\n\nAdded the ability to merge 2 depth buffers, this allows both an alterative method for parallelizing buffer creation and a way to reduce silhouette bleed when input data cannot be roughly sorted from front to back, for example rendering large terrain patches with foreground occluders in an open world game engine.\n\n## Requirements\n\nThis code is mainly optimized for AVX capable CPUs. However, we also provide SSE 4.1 and SSE 2 implementations for backwards compatibility. The appropriate \nimplementation will be chosen during run-time based on the CPU's capabilities.\n\n## Notes on build time\n\nThe code is optimized for runtime performance and may require a long time to compile due to heavy code inlining. This can be worked around by compiling \na library file. An alternative solution is to disable *whole program optimizations* for the `MaskedOcclusionCulling.cpp`, \n`MaskedOcclusionCullingAVX2.cpp` and `MaskedOcclusionCullingAVX512.cpp` files. It does not impact runtime performance, but greatly reduces the time of program linking. \n\n## \u003ca name=\"cs\"\u003e\u003c/a\u003eNotes on coordinate systems and winding\n\nMost inputs are given as clip space (x,y,w) coordinates assuming the same right handed coordinate system as used by DirectX and OpenGL (x positive right, y\npositive up and w positive in the view direction). Note that we use the clip space w coordinate for depth and disregard the z coordinate. Internally our\nmasked hierarchical depth buffer stores *depth = 1 / w*. \n\nThe `TestRect()` function is an exception and instead accepts normalized device coordinates (NDC), *(x' = x/w, y' = y/w)*, where the visible screen region\nmaps to the range [-1,1] for *x'* and *y'* (x positive right and y positive up). Again, this is consistent with both DirectX and OpenGL behavior.\n\nBy default, the screen space coordinate system used internally to access our hierarchical depth buffer follows DirectX conventions (y positive down), which is\n**not** consistent with OpenGL (y positive up). This can be configured by changing the `USE_D3D` define. The screen space coordinate system affects the layout\nof the buffer returned by the `ComputePixelDepthBuffer()` function, scissor rectangles (which are specified in screen space coordinates), and rasterization\ntie-breaker rules if `PRECISE_COVERAGE` is enabled.\n\n## API / Tutorial\n\nWe have made an effort to keep the API as simple and minimal as possible. The rendering functions are quite similar to submitting DirectX or OpenGL drawcalls\nand we hope they will feel natural to anyone with graphics programming experience. In the following we will use the example project as a tutorial to showcase\nthe API. Please refer to the documentation in the header file for further details.\n\n### Setup\n\nWe begin by creating a new instance of the occlusion culling object. The object is created using the static `Create()` function rather than a standard\nconstructor, and can be destroyed using the `Destroy()` function. The reason for using the factory `Create()`/`Destroy()` design pattern is that we want to\nsupport custom (aligned) memory allocators, and that the library choses either the AVX-512, AVX or SSE implementation based on the CPU's capabilities.\n\n```C++\nMaskedOcclusionCulling *moc = MaskedOcclusionCulling::Create();\n\n...\n\nMaskedOcclusionCulling::Destroy(moc);\n```\n\nThe created object is empty and has no hierarchical depth buffer attached, so we must first allocate a buffer using the `SetResolution()` function. This function can\nalso be used later to resize the hierarchical depth buffer, causing it to be re-allocated. Note that the resolution width must be a multiple of 8, and the height\na multiple of 4. This is a limitation of the occlusion culling algorithm.\n\n```C++\nint width = 1920;\nint height = 1080;\nmoc.SetResolution(width, height);   // Set full HD resolution\n```\nAfter setting the resolution we can start rendering occluders and performing occlusion queries. We must first clear the hierarchical depth buffer\n\n```C++\n// Clear hierarchical depth buffer to far depth\nmoc.ClearDepthBuffer();\n```\n\n**Optional** The `SetNearClipPlane()` function can be used to configure the distance to the near clipping plane to make the occlusion culling renderer match your DX/GL\nrenderer. The default value for the near plane is 0 which should work as expected unless your application relies on having onscreen geometry clipped by\nthe near plane.\n\n```C++\nfloat nearClipDist = 1.0f;\nmoc.SetNearClipPlane(nearClipDist); // Set near clipping dist (optional)\n```\n\n### Occluder rendering\n\nThe `RenderTriangles()` function renders triangle meshes to the hierarchical depth buffer. Similar to DirectX/OpenGL, meshes are constructed from a vertex array\nand an triangle index array. By default, the vertices are given as *(x,y,z,w)* floating point clip space coordinates, but the *z*-coordinate is ignored and\ninstead we use *depth = 1 / w*. We expose a `TransformVertices()` utility function to transform vertices from *(x,y,z,1)* model/world space to *(x,y,z,w)* clip\nspace, but you can use your own transform code as well. For more information on the `TransformVertices()` function, please refer to the documentaiton in the\nheader file.\n\nThe triangle index array is identical to a DirectX or OpenGL triangle list and connects vertices to form triangles. Every three indices in the array form a new\ntriangle, so the size of the array must be a multiple of 3. Note that we only support triangle lists, and we currently have no plans on supporting other primitives\nsuch as strips or fans.\n\n```C++\nstruct ClipSpaceVertex { float x, y, z, w; };\n\n// Create an example triangle. The z component of each vertex is not used by the\n// occlusion culling system. \nClipspaceVertex triVerts[] = { { 5, 0, 0, 10 }, { 30, 0, 0, 20 }, { 10, 50, 0, 40 } };\nunsigned int triIndices[] = { 0, 1, 2 };\nunsigned int nTris = 1;\n\n// Render an example triangle\nmoc.RenderTriangles(triVerts, triIndices, nTris);\n```\n\n**Transform** It is possible to include a transform when calling `RenderTriangles()`, by passing the modelToClipSpace parameter.  This is equivalent to calling `TransformVertices()`, followed\nby `RenderTriangles()`, but performing the transform as shown in the example below typically\nleads to better performance.\n\n```C++\n// Example matrix swapping the x and y coordinates\nfloat swapxyMatrix[4][4] = {\n\t{0,1,0,0},\n\t{1,0,0,0},\n\t{0,0,1,0},\n\t{0,0,0,1}};\n\n// Render triangle with transform.\nmoc.RenderTriangles(triVerts, triIndices, nTris, swapxyMatrix);\n```\n\n**Backface Culling** By default, clockwise winded triangles are considered backfacing and are culled when rasterizing occluders. However, you can \nconfigure the `RenderTriangles()` function to backface cull either clockwise or counter-clockwise winded triangles, or to disable backface culling\nfor two-sided rendering.\n\n```C++\n// A clockwise winded (normally backfacing) triangle\nClipspaceVertex cwTriVerts[] = { { 7, -7, 0, 20 },{ 7.5, -7, 0, 20 },{ 7, -7.5, 0, 20 } };\nunsigned int cwTriIndices[] = { 0, 1, 2 };\n\n// Render with counter-clockwise backface culling, the triangle is drawn\nmoc-\u003eRenderTriangles((float*)cwTriVerts, cwTriIndices, 1, nullptr, BACKFACE_CCW);\n```\n\nThe rasterization code only handles counter-clockwise winded triangles, so configurable backface culling is implemented by re-winding clockwise winded triangles \non the fly. Therefore, other culling modes than `BACKFACE_CW` may decrease performance slightly.\n\n**Clip Flags** `RenderTriangles()` accepts an additional parameter to optimize polygon clipping. The calling application may disable any clipping plane if it can\nguarantee that the mesh does not intersect said clipping plane. In the example below we have a quad which is entirely on screen, and we can disable\nall clipping planes. **Warning** it is unsafe to incorrectly disable clipping planes and this may cause the program to crash or perform out of bounds\nmemory accesses. Consider this a power user feature (use `CLIP_PLANE_ALL` to clip against the full frustum when in doubt).\n\n```C++\n// Create a quad completely within the view frustum\nClipspaceVertex quadVerts[]\n\t= { { -150, -150, 0, 200 },{ -10, -65, 0, 75 },{ 0, 0, 0, 20 },{ -40, 10, 0, 50 } };\nunsigned int quadIndices[] = { 0, 1, 2, 0, 2, 3 };\nunsigned int nTris = 2;\n\n// Render the quad. As an optimization, indicate that clipping is not required\nmoc.RenderTriangles((float*)quadVerts, quadIndices, nTris, nullptr, BACKFACE_CW, CLIP_PLANE_NONE);\n```\n\n**Vertex Storage Layout** Finally, the `RenderTriangles()` supports configurable vertex storage layout. The code so far has used an array of structs (AoS) layout based \non the `ClipSpaceVertex` struct, and this is the default behaviour. You may use the `VertexLayout` struct to configure the memory layout of the vertex data. Note that \nthe vertex pointer passed to the `RenderTriangles()` should point at the *x* coordinate of the first vertex, so there is no x coordinate offset specified in the struct.\n\n```C++\nstruct VertexLayout\n{\n\tint mStride;  // Stride between vertices\n\tint mOffsetY; // Offset to vertex y coordinate\n\tint mOffsetW; // Offset to vertex w coordinate\n};\n```\n\nFor example, you can configure a struct of arrays (SoA) layout as follows\n\n```C++\n// A triangle specified on struct of arrays (SoA) form\nfloat SoAVerts[] = {\n\t 10, 10,   7, // x-coordinates\n\t-10, -7, -10, // y-coordinates\n\t 10, 10,  10  // w-coordinates\n};\n\n// Set vertex layout (stride, y offset, w offset)\nVertexLayout SoAVertexLayout(sizeof(float), 3 * sizeof(float), 6 * sizeof(float));\n\n// Render triangle with SoA layout\nmoc.RenderTriangles((float*)SoAVerts, triIndices, 1, nullptr, BACKFACE_CW, CLIP_PLANE_ALL, SoAVertexLayout);\n```\n\nVertex layout may affect performance. We have seen no large performance impact when using either SoA or AoS layout, but generally speaking the\nvertex position data should be packed as compactly into memory as possible to minimize number of cache misses. It is, for example, not advicable to bundle vertex\nposition data together with normals, texture coordinates, etc. and using a large stride.\n\n### Occlusion queries\n\nAfter rendering a few occluder meshes you can begin to perform occlusion queries. There are two functions for occlusion queries, called `TestTriangles()` and\n`TestRect()`. The `TestTriangles()` function is identical to `RenderTriangles()` with the exception being that it performs an occlusion query and does not\nupdate the hierarchical depth buffer. The result of the occlusion query is returned as an enum, which indicates if the triangles are `VISIBLE`, `OCCLUDED`, or were\n`VIEW_CULLED`. Here, `VIEW_CULLED` means that all triangles were either frustum or back face culling, so no occlusion culling test had to be performed.\n\n```C++\n// A triangle that is partly, but not completely, overlapped by the quad rendered before\nClipspaceVertex oqTriVerts[] = { { 0, 50, 0, 200 },{ -60, -60, 0, 200 },{ 20, -40, 0, 200 } };\nunsigned int oqTriIndices[] = { 0, 1, 2 };\nunsigned int nTris = 1;\n\n// Perform an occlusion query. The triangle is visible and the query should return VISIBLE\nCullingResult result = moc.TestTriangles((float*)oqTriVerts, oqTriIndices, nTris);\n```\n\nThe `TestRect()` function performs an occlusion query for a rectangular screen space region with a given depth. It can be used to, for example, quickly test\nthe projected bounding box of an object to determine if the entire object is visible or not. The function is considerably faster than `TestTriangles()` becuase\nit does not require input assembly, clipping, or triangle rasterization. The queries are typically less accurate as screen space bounding rectangles tend to\ngrow quite large, but we've personally seen best overall performance using this type of culling.\n\n```C++\n// Perform an occlusion query testing if a rectangle is visible. The rectangle is completely\n// behind the previously drawn quad, so the query should indicate that it's occluded\nresult = moc.TestRect(-0.6f, -0.6f, -0.4f, -0.4f, 100);\n```\n\nUnlike the other functions the input to `TestRect()` is normalized device coordinates (NDC). Normalized device coordinates are projected clip space coordinates\n*(x' = x/w, y' = y/w)* and the visible screen maps to the range [-1,1] for both the *x'* and *y'* coordinate. The w coordinate is still given in clip space,\nhowever. It is up to the application to compute projected bounding rectangles from the object's bounding shapes.\n\n### Debugging and visualization\n\nWe expose a utility function, `ComputePixelDepthBuffer()` that can be used to visualize the hierarchical depth buffer used internally by the occlusion culling\nsystem. The function fills in a complete per-pixel depth buffer, but the internal representation is hierarchical with just two depth values and a mask stored per\ntile. It is not reasonable to expect the image to completely match the exact depth buffer, and you may notice some areas where backrgound objects leak through\nthe foreground. Leakage is part of the algorithm (and one reason for the high performance), and we have\nnot found it to be problematic. However, if you experience issues due to leakage you may want to disable the `QUICK_MASK` define, described in more detail in the\nsection on [hierarchical depth buffer updates](#update).\n\n```C++\n// Compute a per pixel depth buffer from the hierarchical depth buffer, used for visualization.\nfloat *perPixelZBuffer = new float[width * height];\nmoc.ComputePixelDepthBuffer(perPixelZBuffer);\n```\n\nWe also support basic instrumentation to help with profiling and debugging. By defining `ENABLE_STATS` in the header file, the occlusion culling code will\ngather statistics about the number of occluders rendered and occlusion queries performed. For more details about the statistics, see the\n`OcclusionCullingStatistics` struct. The statistics can be queried using the `GetStatistics()` function, which will simply return a zeroed struct if `ENABLE_STATS`\nis not defined. Note that instrumentation reduces performance somewhat and should generally be disabled in release builds.\n\n```C++\nOcclusionCullingStatistics stats = moc.GetStatistics();\n```\n\n### Memory management\n\nAs shown in the example below, you may optionally provide callback functions for allocating and freeing memory when creating a\n`MaskedOcclusionCulling` object. The functions must support aligned allocations.\n\n```C++\nvoid *alignedAllocCallback(size_t alignment, size_t size)\n{\n\t...\n}\n\nvoid alignedFreeCallback(void *ptr)\n{\n\t...\n}\n\nMaskedOcclusionCulling *moc = MaskedOcclusionCulling::Create(alignedAllocCallback, alignedFreeCallback);\n```\n\n## \u003ca name=\"update\"\u003e\u003c/a\u003eHierarchical depth buffer update algorithm and render order\n\nThe library contains two update algorithms / heuristics for the hierarchical depth buffer, one focused on speed and one focused on accuracy. The\nactive algorithm can be configured using the `QUICK_MASK` define. Setting the define (default) enables algorithm is described in the research paper\n[\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling), which has a good balance between low\nleakage and good performance. Not defining `QUICK_MASK` enables the mergine heuristic used in the paper\n[\"Masked Depth Culling for Graphics Hardware\"](http://dl.acm.org/citation.cfm?id=2818138). It is more accurate, with less leakage, but also has lower performance.\n\nIf you experience problems due to leakage you may want to use the more accurate update algorithm. However, rendering order can also affect the quality\nof the hierarchical depth buffer, with the best order being rendering objects front-to-back. We perform early depth culling tests during occluder\n\nrendering, so rendering in front-to-back order will not only improve quality, but also greatly improve performance of occluder rendering. If your scene\nis stored in a hierarchical data structure, it is often possible to modify the traversal algorithm to traverse nodes in approximate front-to-back order,\nsee the research paper [\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling) for an example.\n\n## \u003ca name=\"interleaved\"\u003e\u003c/a\u003eInterleaving occluder rendering and occlusion queries\n\nThe library supports *light weight* switching between occluder rendering and occlusion queries. While it is still possible to do occlusion culling\nas a standard two pass algorithm (first render all occluders, then perform all queries) it is typically beneficial to interleave occluder rendering with\nqueries.\n\nThis is especially powerful when rendering objects in front-to-back order. After drawing the first few occluder triangles, you can start performing\nocclusion queries, and if the occlusion query indicate that an object is occluded there is no need to draw the occluder mesh for that object. This\ncan greatly improve the performance of the occlusion culling pass in itself. As described in further detail in the research paper\n[\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling), this may be used to perform early exits in\nBVH traversal code.\n\n## Rasterization precision\n\nThe library supports high precision rasterization through Bresenham interpolation, and this may be enabled by changing the `PRECISE_COVERAGE` define in \nthe header file. The high precision rasterizer is somewhat slower (5-15%) than using the default rasterizer, but is compliant with DirectX 11 and OpenGL \nrasterization rules. We have empirically verified it on a large set of randomly generated on-screen triangles. While there still may be differences to GPU \nrasterization due to clipping or vertex transform precision differences, we have not noticed any differences in rasterized coverage in our test scenes. Note \nthat tie breaker rules and vertex rounding behaves differently between DirectX and OpenGL due to the direction of the screen space Y axis. The `USE_D3D` define\n(enabled by default) can be used to toggle between DirectX or OpenGL behaviour.\n\n## Multi-threading and binned rendering\n\nMulti-threading is supported through a binning rasterizer. The `MaskedOcclusionCulling` class exposes two functions, `BinTriangles()` and `RenderTrilist()`\nthat may be used to perform binning, and render all triangles assigned to a bin. Using binned rasterization makes it simple to guarantee that no two threads are\naccessing the same part of the framebuffer, as rendering is limited to a particular bin, or region of the screen.\n\nBinned rendering starts by performing geometry processing (primitive assembly, vertex transform, clipping, and projection) followed by a binning step, where\ntriangles are written to all bins they overlap. This is performed using the `BinTriangles()` function, which is very similar to the `RenderTriangles()`\nfunction, but provides some additional parameters for specifying the number of bins the screen is split into. The calling application also needs to pass a\npointer to an array of `TriList` object, with one instance per bin. Each `TriList` object points to a \"scratchpad\" memory buffer, and all triangles overlapping\nthat bin will be written to the buffer.\n\n```C++\nconst int binsW = 4;\nconst int binsH = 4;\n\nfloat *dataBuffer = new float[binsW*BinsH*1024*3*3]; // Allocate storage for 1k triangles in each trilist\nTriList *triLists  = new TriList[binsW*binsH];       // Allocate trilists for 4x4 = 16 bins\nfor (int i = 0; i \u003c binsW*BinsH; ++i)\n{\n\ttriLists[i].mNumTriangles = 1024; // triangle list capacity\n\ttriLists[i].mTriIdx = 0; // Set triangle write pointer to first element\n\ttriLists[i].mData = dataBuffer + i*1024*1024;\n}\n\n// Perform geometry processing and write triangles to the triLists of all bins they overlap.\nmoc.BinTriangles(triVerts, triIndices, nTris, triLists, binsW, binsW);\n```\n\nAfter generating the triangle lists for each bin, the triangles may be rendered using the `RenderTrilist()` function and the rendering region should be\nlimited using a scissor rectangle. It should be noted that the `BinTriangles()` function makes assumptions on the size of the bins, and the calling\napplication must therefore always compute the scissor region of each bin, relying on the `ComputeBinWidthHeight()` utility function as shown in the \nexample below. Note that the scissor rectangle is specified in screen space coordinates which depends on the `USE_D3D` define.\n\n```C++\nunsigned int binWidth, binHeight;\nmoc.ComputeBinWidthHeight(mBinsW, mBinsH, binWidth, binHeight);\n\nfor (int by = 0; by \u003c binsH; ++by)\n{\n\tfor (int bx = 0; bx \u003c binsW ; ++bx)\n\t{\n\t\t// Compute scissor rectangle that matches the one assumed by BinTriangles()\n\t\t// note that the ScissorRect is specified in pixel coordinates, with (0,0)\n\t\t// being the bottom left corner\n\t\tScissorRect binRect;\n\t\tbinRect.minX = bx*binWidth;\n\t\tbinRect.maxX = bx + 1 == binsW ? screenWidth : (bx + 1) * binWidth;\n\t\tbinRect.minY = by*binHeight;\n\t\tbinRect.maxY = by + 1 == binsH ? screenHeight : (by + 1) * binHeight;\n\n\t\t// Render all triangles overlapping the current bin.\n\t\tmoc.RenderTrilist(triLists[bx + by*4], \u0026binRect);\n\t}\n}\n```\n\n### Multi-threading example\n\nThis library includes a multi-threading example in the `CullingThreadpool` class. The class interface is similar to that of `MaskedOcclusionCulling`, but occluder\nrendering is performed asynchronously. Calling the `CullingThreadpool::RenderTriangles()` function adds a render job to a command queue and immediately return\nto the calling thread, rather than immediately performing the rendering work. Internally, the class uses the `BinTriangles()` and `RenderTrilist()` functions to\nbin all triangles of the `CullingThreadpool::RenderTriangles()` call, and distribute the tiles. At any time, there may be a number of binning jobs, and tile\nrendering jobs unprocessed, and the scheduler picks the most urgent job and process it first. If a thread runs out of available jobs, task stealing is used as a\nmeans of improving load-balancing.\n\nThe occlusion query functions `CullingThreadpool::TestTriangles()` and `CullingThreadpool::TestRect()` immediately return the result of the query. However, the\nquery depends on the contents of the hierarchical depth buffer you may need to wait for the worker threads to finish to make sure the query is performed on the\nmost up to date version of the buffer, this can be accomplished by calling `CullingThreadpool::Flush()`. It is not always necessary to work with the most up to\ndate version of the hierarchical depth buffer for a query. While the result may be incorrect, it is still always conservative in that occluded objects may be\nclassified as visible, but not the other way around. Since the `CullingThreadpool::Flush()` causes a wait it may be beneficial to work against a slightly out of\ndate version of the hierarchical depth buffer if your application will cause a lot of flushes. We found this particularly true when implementing threading in\nour interleaved BVH traversal algorithm (see the [\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling)\npaper) where each BVH traversal step is based on the outcome of an occlusion query interleaved with occluder rendering for the BVH-leaves.\n\nThe `CullingThreadpool` class was written as an example and not the de-facto threading approach. In some cases we believe it is possible to improve performance\nfurther by threading occlusion queries, or thread the entire occlusion culling system, including scene graph traversal. However, it does provide a simple means\nof enabling multi-threading in a traditional single threaded application as the APIs is very similar to the `MaskedOcclusionCulling` class, and may be called from\na single threaded application. As previously mentioned we integrated this implementation in our interleaved BVH traversal algorithm (see the [\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling)\npaper) and noted speedup of roughly *3x*, running on four threads, compared to our previous single threaded implementation.\n\n## Compiling\n\nThe code has been reworked to support more platforms and compilers, such as [Intel C++ Compiler](https://software.intel.com/en-us/intel-compilers), [G++](https://gcc.gnu.org/) \nand [LLVM/Clang](http://releases.llvm.org/download.html). The original Visual Studio 2015 projects remain and works with both ICC and Microsoft's compilers. Other compilers \nare supported through [CMake](https://cmake.org/). See the `CMakeLists.txt` files in the `Example` and `FillrateTest` folders. You can use CMake to generate a \nVisual Studio project for Clang on Windows:\n\n```\nmd \u003cpath to library\u003e\\Example\\Clang\ncd \u003cpath to library\u003e\\Example\\Clang\ncmake -G\"Visual Studio 14 2015 Win64\" -T\"LLVM-vs2014\" ..\n```\n\nor build the library with G++/Clang on linux systems (the `D3DValidate` sample only works on Windows as it relies on Direct 3D)\n\n```\nmkdir \u003cpath to library\u003e/Example/Release\ncd \u003cpath to library\u003e/Example/Release\ncmake -DCMAKE_BUILD_TYPE=Release ..\nmake\n```\n\nNote that AVX-512 support is only experimental at the moment, and has only been verified through [Intel SDE](https://software.intel.com/en-us/articles/pre-release-license-agreement-for-intel-software-development-emulator-accept-end-user-license-agreement-and-download).\nIf using the original visual studio project, you need to \"opt in\" for AVX-512 support by setting `#define USE_AVX512 1`. When building with CMake you can\nenable AVX support using the `-DUSE_AVX512=ON` option:\n\n```\ncmake -DUSE_AVX512=ON -G\"Visual Studio 14 2015 Win64\" -T\"LLVM-vs2014\" ..\n```\n\n## Version History\n\n* Version 1.4: \n  * Added support for merging 2 depth buffers as detailed in GDC 2018 presenation.\n  * Fixed Profiling counters to be thread safe removing a race condition when runing the CullingThreadpool class.\n* Version 1.3: \n  * **Experimental**: Added support for AVX-512 capable CPUs. Currently only verified through [emulator](https://software.intel.com/en-us/articles/intel-software-development-emulator).\n  * Added multiplatform support. Code now compiles on Visual C++ Compiler, Intel C++ Compiler, GCC, and Clang.\n  * Added configurable backface culling, to support two-sided occluder rendering.\n* Version 1.2: \n  * Added support for threading, through a binning rasterizer. The `CullingThreadpool` class implements an example multi-threaded task system with a very similar \n    API to the `MaskedOcclusionCulling`class.\n  * Added support for higher precision rasterization, with DirectX and OpenGL compliant rasterization rules.\n  * **Note:** The default screen space coordinate system has been changed from OpenGL to DirectX conventions. If you upgrade from an older version of the library\n    this will flip the y coordinate of scissor boxes and the images returned by `ComputePixelDepthBuffer()`. Disabling the `USE_D3D` define changes back to OpenGL conventions.\n* Version 1.1: \n  * Added support for SSE4.1 and SSE2 capable CPUs for backwards compatibility. The SSE versions must emulate some operations using\n  simpler instructions, and are therefore less efficient, with the SSE2 version having the lowest performance.\n* Version 1.0: \n  * Initial revision, only support for AVX2 capable CPUs\n\n## Differences to the research paper\n\nThis code does not exactly match implementation used in\n[\"Masked Software Occlusion Culling\"](https://software.intel.com/en-us/articles/masked-software-occlusion-culling), and performance may vary slightly\nfrom what is presented in the research paper. We aimed for making the API as simple as possible and have removed many limitations, in particular\nrequirements on input data being aligned to SIMD boundaries. This affects performance slightly in both directions. Unaligned loads and\ngathers are more costly, but unaligned data may be packed more efficiently in memory leading to fewer cache misses.\n\n## License agreement\n\nCopyright Intel(R) Corporation 2016-2024.\n\nSee the Apache 2.0 license.txt for full license agreement details.\n\nDisclaimer:\n\nThis software is subject to the U.S. Export Administration Regulations and other U.S.\nlaw, and may not be exported or re-exported to certain countries (Cuba, Iran, North\nKorea, Sudan, and Syria) or to persons or entities prohibited from receiving U.S.\nexports (including Denied Parties, Specially Designated Nationals, and entities on the\nBureau of Export Administration Entity List or involved with missile technology or\nnuclear, chemical or biological weapons)..\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGameTechDev%2FMaskedOcclusionCulling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGameTechDev%2FMaskedOcclusionCulling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGameTechDev%2FMaskedOcclusionCulling/lists"}