{"id":13730772,"url":"https://github.com/sebbbi/perftest","last_synced_at":"2025-05-08T03:31:44.385Z","repository":{"id":49844652,"uuid":"75556029","full_name":"sebbbi/perftest","owner":"sebbbi","description":"GPU texture/buffer performance tester","archived":false,"fork":false,"pushed_at":"2020-11-19T12:59:46.000Z","size":197,"stargazers_count":523,"open_issues_count":5,"forks_count":26,"subscribers_count":28,"default_branch":"master","last_synced_at":"2024-08-04T02:09:49.949Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sebbbi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-12-04T18:32:02.000Z","updated_at":"2024-08-03T03:42:27.000Z","dependencies_parsed_at":"2022-09-18T11:03:09.955Z","dependency_job_id":null,"html_url":"https://github.com/sebbbi/perftest","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebbbi%2Fperftest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebbbi%2Fperftest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebbbi%2Fperftest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sebbbi%2Fperftest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sebbbi","download_url":"https://codeload.github.com/sebbbi/perftest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224695702,"owners_count":17354460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T02:01:19.208Z","updated_at":"2024-11-14T21:31:34.901Z","avatar_url":"https://github.com/sebbbi.png","language":"C++","readme":"# PerfTest\n\nA simple GPU shader memory operation performance test tool. Current implementation is DirectX 11.0 based.\n\nThe purpose of this application is not to benchmark different brand GPUs against each other. Its purpose is to help rendering programmers to choose right types of resources when optimizing their compute shader performance.\n\nThis application is designed to measure peak data load performance from L1 caches. I tried to avoid known hardware bottlenecks. **If you notice something wrong or suspicious in the shader workload, please inform me immediately and I will fix it.** If my shaders are affected by some hardware bottlenecks, I am glad to hear about it and write more test cases to show the best performance. The goal is that developers gain better understanding of various GPU hardware on the market and gain insight to optimize code for them.\n\n## Features\n\nDesigned to measure performance of various types of buffer and image loads. This application is not a GPU memory bandwidth measurement tool. All tests operate inside GPUs L1 caches (no larger than 16 KB working sets). \n\n- Coalesced loads (100% L1 cache hit)\n- Random loads (100% L1 cache hit)\n- Uniform address loads (same address for all threads)\n- Typed Buffer SRVs: 1/2/4 channels, 8/16/32 bits per channel\n- ByteAddressBuffer SRVs: load, load2, load3, load4 - aligned and unaligned\n- Structured Buffer SRVs: float/float2/float4\n- Constant Buffer float4 array indexed loads\n- Texture2D loads: 1/2/4 channels, 8/16/32 bits per channel\n- Texture2D nearest sampling: 1/2/4 channels, 8/16/32 bits per channel\n- Texture2D bilinear sampling: 1/2/4 channels, 8/16/32 bits per channel\n\n## Explanations\n\n**Coalesced loads:**\nGPUs optimize linear address patterns. Coalescing occurs when all threads in a warp/wave (32/64 threads) load from contiguous addresses. In my \"linear\" test case, memory loads access contiguous addresses in the whole thread group (256 threads). This should coalesce perfectly on all GPUs, independent of warp/wave width.\n\n**Random loads:**\nI add a random start offset of 0-15 elements for each thread (still aligned). This prevents GPU coalescing, and provides more realistic view of performance for common case (non-linear) memory accessing. This benchmark is as cache efficient as the previous. All data still comes from the L1 cache.\n\n**Uniform loads:**\nAll threads in group simultaneously load from the same address. This triggers coalesced path on some GPUs and additonal optimizations on some GPUs, such as scalar loads (SGPR storage) on AMD GCN. I have noticed that recent Intel and Nvidia drivers also implement a software optimization for uniform load loop case (which is employed by this benchmark).\n\n**Notes:**\n**Compiler optimizations** can ruin the results. We want to measure only load (read) performance, but write (store) is also needed, otherwise the compiler will just optimize the whole shader away. To avoid this, each thread does first 256 loads followed by a single linear groupshared memory write (no bank-conflicts). Cbuffer contains a write mask (not known at compile time). It controls which elements are written from the groupshared memory to the output buffer. The mask is always zero at runtime. Compilers can also combine multiple narrow raw buffer loads together (as bigger 4d loads) if it an be proven at compile time that loads from the same thread access contiguous offsets. This is prevented by applying an address mask from cbuffer (not known at compile time). \n\n## Uniform Load Investigation\nWhen I first implemented this benchmark, I noticed that Intel uniform address loads were surprisingly fast. Intel ISA documents don't mention anything about a scalar unit or other hardware feature to make uniform address loads fast. This optimization affected every single resource type, unlike AMDs hardware scalar unit (which only works for raw data loads). I didnt't investigate this further however at that point. When Nvidia released Volta GPUs, they brought new driver that implemented similar compiler optimization. Later drivers introduced the same optimization to Maxwell and Pascal too. And now Turing also has it. It's certainly not hardware based, since 20x+ gains apply to all their existing GPUs too.\n\nIn Nov 10-11 weekend (2018) I was toying around with Vulkan/DX12 wave intrinsics, and came up with a crazy idea to use a single wave wide load and then use wave intrinsics to broadcast scalar result (single lane) to each loop iteration. This results in up to wave width reduced amount of loads. \n\nSee the gist and Shader Playground links here:\nhttps://gist.github.com/sebbbi/ba4415339b535d22fb18e2d824564ec4\n\nIn Nvidia's uniform load optimization case, their wave width = 32, and their uniform load optimization performance boost is up to 28x. This finding really made me curious. Could Nvidia implement a similar warp shuffle based optimization for this use case? The funny thing is that my tweets escalated the situation, and made Intel reveal their hand:\n\nhttps://twitter.com/JoshuaBarczak/status/1062060067334189056\n\nIntel has now officially revealed that their driver does a wave shuffle optimization for uniform address loads. They have been doing it for years already. This explains Intel GPU benchmark results perfectly. Now that we have confirmation of Intel's (original) optimization, I suspect that Nvidia's shader compiler employs a highly similar optimization in this case. Both optimizations are great, because Nvidia/Intel do not have a dedicated scalar unit. They need to lean more on vector loads, and this trick allows sharing one vector load with multiple uniform address load loop iterations.\n\n## Results\nAll results are compared to ```Buffer\u003cRGBA8\u003e.Load random``` result (=1.0x) on the same GPU.\n\n### AMD GCN2 (R9 390X)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 11.302ms 3.907x\nBuffer\u003cR8\u003e.Load linear: 11.327ms 3.899x\nBuffer\u003cR8\u003e.Load random: 44.150ms 1.000x\nBuffer\u003cRG8\u003e.Load uniform: 49.611ms 0.890x\nBuffer\u003cRG8\u003e.Load linear: 49.835ms 0.886x\nBuffer\u003cRG8\u003e.Load random: 49.615ms 0.890x\nBuffer\u003cRGBA8\u003e.Load uniform: 44.149ms 1.000x\nBuffer\u003cRGBA8\u003e.Load linear: 44.806ms 0.986x\nBuffer\u003cRGBA8\u003e.Load random: 44.164ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 11.131ms 3.968x\nBuffer\u003cR16f\u003e.Load linear: 11.139ms 3.965x\nBuffer\u003cR16f\u003e.Load random: 44.076ms 1.002x\nBuffer\u003cRG16f\u003e.Load uniform: 49.552ms 0.891x\nBuffer\u003cRG16f\u003e.Load linear: 49.560ms 0.891x\nBuffer\u003cRG16f\u003e.Load random: 49.559ms 0.891x\nBuffer\u003cRGBA16f\u003e.Load uniform: 44.066ms 1.002x\nBuffer\u003cRGBA16f\u003e.Load linear: 44.687ms 0.988x\nBuffer\u003cRGBA16f\u003e.Load random: 44.066ms 1.002x\nBuffer\u003cR32f\u003e.Load uniform: 11.132ms 3.967x\nBuffer\u003cR32f\u003e.Load linear: 11.139ms 3.965x\nBuffer\u003cR32f\u003e.Load random: 44.071ms 1.002x\nBuffer\u003cRG32f\u003e.Load uniform: 49.558ms 0.891x\nBuffer\u003cRG32f\u003e.Load linear: 49.560ms 0.891x\nBuffer\u003cRG32f\u003e.Load random: 49.559ms 0.891x\nBuffer\u003cRGBA32f\u003e.Load uniform: 44.061ms 1.002x\nBuffer\u003cRGBA32f\u003e.Load linear: 44.613ms 0.990x\nBuffer\u003cRGBA32f\u003e.Load random: 49.583ms 0.891x\nByteAddressBuffer.Load uniform: 10.322ms 4.278x\nByteAddressBuffer.Load linear: 11.546ms 3.825x\nByteAddressBuffer.Load random: 44.153ms 1.000x\nByteAddressBuffer.Load2 uniform: 11.499ms 3.841x\nByteAddressBuffer.Load2 linear: 49.628ms 0.890x\nByteAddressBuffer.Load2 random: 49.651ms 0.889x\nByteAddressBuffer.Load3 uniform: 16.985ms 2.600x\nByteAddressBuffer.Load3 linear: 44.142ms 1.000x\nByteAddressBuffer.Load3 random: 88.176ms 0.501x\nByteAddressBuffer.Load4 uniform: 22.472ms 1.965x\nByteAddressBuffer.Load4 linear: 44.212ms 0.999x\nByteAddressBuffer.Load4 random: 49.346ms 0.895x\nByteAddressBuffer.Load2 unaligned uniform: 11.422ms 3.867x\nByteAddressBuffer.Load2 unaligned linear: 49.552ms 0.891x\nByteAddressBuffer.Load2 unaligned random: 49.561ms 0.891x\nByteAddressBuffer.Load4 unaligned uniform: 22.373ms 1.974x\nByteAddressBuffer.Load4 unaligned linear: 44.095ms 1.002x\nByteAddressBuffer.Load4 unaligned random: 54.464ms 0.811x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 12.585ms 3.509x\nStructuredBuffer\u003cfloat\u003e.Load linear: 11.770ms 3.752x\nStructuredBuffer\u003cfloat\u003e.Load random: 44.176ms 1.000x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 13.210ms 3.343x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 50.217ms 0.879x\nStructuredBuffer\u003cfloat2\u003e.Load random: 49.645ms 0.890x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 13.818ms 3.196x\nStructuredBuffer\u003cfloat4\u003e.Load random: 49.666ms 0.889x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 44.721ms 0.988x\ncbuffer{float4} load uniform: 16.702ms 2.644x\ncbuffer{float4} load linear: 44.447ms 0.994x\ncbuffer{float4} load random: 49.656ms 0.889x\nTexture2D\u003cR8\u003e.Load uniform: 44.214ms 0.999x\nTexture2D\u003cR8\u003e.Load linear: 44.795ms 0.986x\nTexture2D\u003cR8\u003e.Load random: 44.808ms 0.986x\nTexture2D\u003cRG8\u003e.Load uniform: 49.706ms 0.888x\nTexture2D\u003cRG8\u003e.Load linear: 50.231ms 0.879x\nTexture2D\u003cRG8\u003e.Load random: 50.200ms 0.880x\nTexture2D\u003cRGBA8\u003e.Load uniform: 44.760ms 0.987x\nTexture2D\u003cRGBA8\u003e.Load linear: 45.339ms 0.974x\nTexture2D\u003cRGBA8\u003e.Load random: 45.405ms 0.973x\nTexture2D\u003cR16F\u003e.Load uniform: 44.175ms 1.000x\nTexture2D\u003cR16F\u003e.Load linear: 44.157ms 1.000x\nTexture2D\u003cR16F\u003e.Load random: 44.096ms 1.002x\nTexture2D\u003cRG16F\u003e.Load uniform: 49.739ms 0.888x\nTexture2D\u003cRG16F\u003e.Load linear: 49.661ms 0.889x\nTexture2D\u003cRG16F\u003e.Load random: 49.622ms 0.890x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 44.257ms 0.998x\nTexture2D\u003cRGBA16F\u003e.Load linear: 44.267ms 0.998x\nTexture2D\u003cRGBA16F\u003e.Load random: 88.126ms 0.501x\nTexture2D\u003cR32F\u003e.Load uniform: 44.259ms 0.998x\nTexture2D\u003cR32F\u003e.Load linear: 44.193ms 0.999x\nTexture2D\u003cR32F\u003e.Load random: 44.099ms 1.001x\nTexture2D\u003cRG32F\u003e.Load uniform: 49.739ms 0.888x\nTexture2D\u003cRG32F\u003e.Load linear: 49.667ms 0.889x\nTexture2D\u003cRG32F\u003e.Load random: 88.110ms 0.501x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 44.288ms 0.997x\nTexture2D\u003cRGBA32F\u003e.Load linear: 66.145ms 0.668x\nTexture2D\u003cRGBA32F\u003e.Load random: 88.124ms 0.501x\n```\n**AMD GCN2** was a very popular architecture. First card using this architecture was Radeon 7790. Many Radeon 200 and 300 series cards also use this architecture. Both Xbox and PS4 (base model) GPUs are based on GCN2 architecture, making this architecture very important optimization target.\n\n**Typed loads:** GCN coalesces linear typed loads. But only 1d loads (R8, R16F, R32F). Coalesced load performance is 4x. Both linear access pattern (all threads in wave load subsequent addresses) and uniform access (all threads in wave load the same address) coalesce perfectly. Typed loads of every dimension (1d/2d/4d) and channel width (8b/16b/32b) perform identically. Best bytes/cycle rate can be achieved either by R32 coalesced load (when access pattern suits this) or always with RGBA32 load.\n\n**Raw (ByteAddressBuffer) loads:** Similar to typed loads. 1d raw loads coalesce perfectly (4x) on linear access. Uniform address raw loads generates scalar unit loads on GCN. Scalar loads use separate cache and are stored to separate SGPR register file -\u003e reduced register \u0026 cache pressure \u0026 doesn't stress vector load path. Scalar 1d load is 4x faster than normal 1d load. Scalar 2d load is 4x faster than normal 2d load. Scalar 4d load is 2x faster than normal 4d load. Unaligned (alignment=4) loads have equal performance to aligned (alignment=8/16). 3d raw linear loads have equal performance to 4d loads, but random 3d loads are slightly slower.\n\n**Texture loads:** Similar performance as typed buffer loads. However no coalescing in 1d linear access and no scalar unit offload of uniform access. Random access of wide formats tends to be slightly slower (but my 2d random produces different access pattern than 1d).\n\n**Structured buffer loads:** Performance is identical to similar width raw buffer loads.\n\n**Cbuffer loads:** AMD GCN architecture doesn't have special constant buffer hardware. Constant buffer load performance is identical to raw and structured buffers. Prefer uniform addresses to allow the compiler to generate scalar loads, which is around 4x faster and has much lower latency and doesn't waste VGPRs.\n\n**Suggestions:** Prefer wide fat 4d loads instead of multiple narrow loads. If you have perfectly linear memory access pattern, 1d coalesced loads are also fast. ByteAddressBuffers (raw loads) have good performance: Full speed 128 bit 4d loads, 4x rate 1d loads (linear access), and the compiler offloads uniform address loads to scalar unit, saving VGPR pressure and vector memory instructions.\n\nThese results match with AMDs wide loads \u0026 coalescing documents, see: http://gpuopen.com/gcn-memory-coalescing/. I would be glad if AMD released a public document describing all scalar load optimization cases supported by their compiler.\n\n### AMD GCN3 (R9 Fury 56 CU)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 8.963ms 3.911x\nBuffer\u003cR8\u003e.Load linear: 8.917ms 3.931x\nBuffer\u003cR8\u003e.Load random: 35.058ms 1.000x\nBuffer\u003cRG8\u003e.Load uniform: 39.416ms 0.889x\nBuffer\u003cRG8\u003e.Load linear: 39.447ms 0.889x\nBuffer\u003cRG8\u003e.Load random: 39.413ms 0.889x\nBuffer\u003cRGBA8\u003e.Load uniform: 35.051ms 1.000x\nBuffer\u003cRGBA8\u003e.Load linear: 35.048ms 1.000x\nBuffer\u003cRGBA8\u003e.Load random: 35.051ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 8.898ms 3.939x\nBuffer\u003cR16f\u003e.Load linear: 8.909ms 3.934x\nBuffer\u003cR16f\u003e.Load random: 35.050ms 1.000x\nBuffer\u003cRG16f\u003e.Load uniform: 39.405ms 0.890x\nBuffer\u003cRG16f\u003e.Load linear: 39.435ms 0.889x\nBuffer\u003cRG16f\u003e.Load random: 39.407ms 0.889x\nBuffer\u003cRGBA16f\u003e.Load uniform: 35.041ms 1.000x\nBuffer\u003cRGBA16f\u003e.Load linear: 35.043ms 1.000x\nBuffer\u003cRGBA16f\u003e.Load random: 35.046ms 1.000x\nBuffer\u003cR32f\u003e.Load uniform: 8.897ms 3.940x\nBuffer\u003cR32f\u003e.Load linear: 8.910ms 3.934x\nBuffer\u003cR32f\u003e.Load random: 35.048ms 1.000x\nBuffer\u003cRG32f\u003e.Load uniform: 39.407ms 0.889x\nBuffer\u003cRG32f\u003e.Load linear: 39.433ms 0.889x\nBuffer\u003cRG32f\u003e.Load random: 39.406ms 0.889x\nBuffer\u003cRGBA32f\u003e.Load uniform: 35.043ms 1.000x\nBuffer\u003cRGBA32f\u003e.Load linear: 35.045ms 1.000x\nBuffer\u003cRGBA32f\u003e.Load random: 39.405ms 0.890x\nByteAddressBuffer.Load uniform: 10.956ms 3.199x\nByteAddressBuffer.Load linear: 9.100ms 3.852x\nByteAddressBuffer.Load random: 35.038ms 1.000x\nByteAddressBuffer.Load2 uniform: 11.070ms 3.166x\nByteAddressBuffer.Load2 linear: 39.413ms 0.889x\nByteAddressBuffer.Load2 random: 39.411ms 0.889x\nByteAddressBuffer.Load3 uniform: 13.534ms 2.590x\nByteAddressBuffer.Load3 linear: 35.047ms 1.000x\nByteAddressBuffer.Load3 random: 70.033ms 0.500x\nByteAddressBuffer.Load4 uniform: 17.944ms 1.953x\nByteAddressBuffer.Load4 linear: 35.072ms 0.999x\nByteAddressBuffer.Load4 random: 39.149ms 0.895x\nByteAddressBuffer.Load2 unaligned uniform: 11.209ms 3.127x\nByteAddressBuffer.Load2 unaligned linear: 39.408ms 0.889x\nByteAddressBuffer.Load2 unaligned random: 39.406ms 0.890x\nByteAddressBuffer.Load4 unaligned uniform: 17.933ms 1.955x\nByteAddressBuffer.Load4 unaligned linear: 35.066ms 1.000x\nByteAddressBuffer.Load4 unaligned random: 43.241ms 0.811x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 12.653ms 2.770x\nStructuredBuffer\u003cfloat\u003e.Load linear: 8.913ms 3.932x\nStructuredBuffer\u003cfloat\u003e.Load random: 35.059ms 1.000x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 12.799ms 2.739x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 39.445ms 0.889x\nStructuredBuffer\u003cfloat2\u003e.Load random: 39.413ms 0.889x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 12.834ms 2.731x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 35.049ms 1.000x\nStructuredBuffer\u003cfloat4\u003e.Load random: 39.411ms 0.889x\ncbuffer{float4} load uniform: 14.861ms 2.359x\ncbuffer{float4} load linear: 35.534ms 0.986x\ncbuffer{float4} load random: 39.412ms 0.889x\nTexture2D\u003cR8\u003e.Load uniform: 35.063ms 1.000x\nTexture2D\u003cR8\u003e.Load linear: 35.038ms 1.000x\nTexture2D\u003cR8\u003e.Load random: 35.040ms 1.000x\nTexture2D\u003cRG8\u003e.Load uniform: 39.430ms 0.889x\nTexture2D\u003cRG8\u003e.Load linear: 39.436ms 0.889x\nTexture2D\u003cRG8\u003e.Load random: 39.436ms 0.889x\nTexture2D\u003cRGBA8\u003e.Load uniform: 35.059ms 1.000x\nTexture2D\u003cRGBA8\u003e.Load linear: 35.061ms 1.000x\nTexture2D\u003cRGBA8\u003e.Load random: 35.055ms 1.000x\nTexture2D\u003cR16F\u003e.Load uniform: 35.056ms 1.000x\nTexture2D\u003cR16F\u003e.Load linear: 35.038ms 1.000x\nTexture2D\u003cR16F\u003e.Load random: 35.040ms 1.000x\nTexture2D\u003cRG16F\u003e.Load uniform: 39.431ms 0.889x\nTexture2D\u003cRG16F\u003e.Load linear: 39.440ms 0.889x\nTexture2D\u003cRG16F\u003e.Load random: 39.436ms 0.889x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 35.054ms 1.000x\nTexture2D\u003cRGBA16F\u003e.Load linear: 35.061ms 1.000x\nTexture2D\u003cRGBA16F\u003e.Load random: 70.037ms 0.500x\nTexture2D\u003cR32F\u003e.Load uniform: 35.055ms 1.000x\nTexture2D\u003cR32F\u003e.Load linear: 35.041ms 1.000x\nTexture2D\u003cR32F\u003e.Load random: 35.041ms 1.000x\nTexture2D\u003cRG32F\u003e.Load uniform: 39.433ms 0.889x\nTexture2D\u003cRG32F\u003e.Load linear: 39.439ms 0.889x\nTexture2D\u003cRG32F\u003e.Load random: 70.039ms 0.500x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 35.054ms 1.000x\nTexture2D\u003cRGBA32F\u003e.Load linear: 52.549ms 0.667x\nTexture2D\u003cRGBA32F\u003e.Load random: 70.037ms 0.500x\n ```\n  \n**AMD GCN3** results (ratios) are identical to GCN2. See GCN2 for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between different GCN revisions.\n\n### AMD GCN4 (RX 480)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 11.008ms 3.900x\nBuffer\u003cR8\u003e.Load linear: 11.187ms 3.838x\nBuffer\u003cR8\u003e.Load random: 42.906ms 1.001x\nBuffer\u003cRG8\u003e.Load uniform: 48.280ms 0.889x\nBuffer\u003cRG8\u003e.Load linear: 48.685ms 0.882x\nBuffer\u003cRG8\u003e.Load random: 48.246ms 0.890x\nBuffer\u003cRGBA8\u003e.Load uniform: 42.911ms 1.001x\nBuffer\u003cRGBA8\u003e.Load linear: 43.733ms 0.982x\nBuffer\u003cRGBA8\u003e.Load random: 42.934ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 10.852ms 3.956x\nBuffer\u003cR16f\u003e.Load linear: 10.840ms 3.961x\nBuffer\u003cR16f\u003e.Load random: 42.820ms 1.003x\nBuffer\u003cRG16f\u003e.Load uniform: 48.153ms 0.892x\nBuffer\u003cRG16f\u003e.Load linear: 48.161ms 0.891x\nBuffer\u003cRG16f\u003e.Load random: 48.161ms 0.891x\nBuffer\u003cRGBA16f\u003e.Load uniform: 42.832ms 1.002x\nBuffer\u003cRGBA16f\u003e.Load linear: 42.900ms 1.001x\nBuffer\u003cRGBA16f\u003e.Load random: 42.844ms 1.002x\nBuffer\u003cR32f\u003e.Load uniform: 10.852ms 3.956x\nBuffer\u003cR32f\u003e.Load linear: 10.841ms 3.960x\nBuffer\u003cR32f\u003e.Load random: 42.816ms 1.003x\nBuffer\u003cRG32f\u003e.Load uniform: 48.158ms 0.892x\nBuffer\u003cRG32f\u003e.Load linear: 48.161ms 0.891x\nBuffer\u003cRG32f\u003e.Load random: 48.161ms 0.891x\nBuffer\u003cRGBA32f\u003e.Load uniform: 42.827ms 1.002x\nBuffer\u003cRGBA32f\u003e.Load linear: 42.913ms 1.000x\nBuffer\u003cRGBA32f\u003e.Load random: 48.176ms 0.891x\nByteAddressBuffer.Load uniform: 13.403ms 3.203x\nByteAddressBuffer.Load linear: 11.118ms 3.862x\nByteAddressBuffer.Load random: 42.911ms 1.001x\nByteAddressBuffer.Load2 uniform: 13.503ms 3.180x\nByteAddressBuffer.Load2 linear: 48.235ms 0.890x\nByteAddressBuffer.Load2 random: 48.242ms 0.890x\nByteAddressBuffer.Load3 uniform: 16.646ms 2.579x\nByteAddressBuffer.Load3 linear: 42.913ms 1.001x\nByteAddressBuffer.Load3 random: 85.682ms 0.501x\nByteAddressBuffer.Load4 uniform: 21.836ms 1.966x\nByteAddressBuffer.Load4 linear: 42.929ms 1.000x\nByteAddressBuffer.Load4 random: 47.936ms 0.896x\nByteAddressBuffer.Load2 unaligned uniform: 13.454ms 3.191x\nByteAddressBuffer.Load2 unaligned linear: 48.150ms 0.892x\nByteAddressBuffer.Load2 unaligned random: 48.163ms 0.891x\nByteAddressBuffer.Load4 unaligned uniform: 21.765ms 1.973x\nByteAddressBuffer.Load4 unaligned linear: 42.853ms 1.002x\nByteAddressBuffer.Load4 unaligned random: 52.866ms 0.812x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 15.513ms 2.768x\nStructuredBuffer\u003cfloat\u003e.Load linear: 10.895ms 3.941x\nStructuredBuffer\u003cfloat\u003e.Load random: 42.885ms 1.001x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 15.695ms 2.736x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 48.231ms 0.890x\nStructuredBuffer\u003cfloat2\u003e.Load random: 48.217ms 0.890x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 15.810ms 2.716x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 42.907ms 1.001x\nStructuredBuffer\u003cfloat4\u003e.Load random: 48.224ms 0.890x\ncbuffer{float4} load uniform: 17.249ms 2.489x\ncbuffer{float4} load linear: 43.054ms 0.997x\ncbuffer{float4} load random: 48.214ms 0.890x\nTexture2D\u003cR8\u003e.Load uniform: 42.889ms 1.001x\nTexture2D\u003cR8\u003e.Load linear: 42.877ms 1.001x\nTexture2D\u003cR8\u003e.Load random: 42.889ms 1.001x\nTexture2D\u003cRG8\u003e.Load uniform: 48.252ms 0.890x\nTexture2D\u003cRG8\u003e.Load linear: 48.254ms 0.890x\nTexture2D\u003cRG8\u003e.Load random: 48.254ms 0.890x\nTexture2D\u003cRGBA8\u003e.Load uniform: 42.939ms 1.000x\nTexture2D\u003cRGBA8\u003e.Load linear: 42.969ms 0.999x\nTexture2D\u003cRGBA8\u003e.Load random: 42.945ms 1.000x\nTexture2D\u003cR16F\u003e.Load uniform: 42.891ms 1.001x\nTexture2D\u003cR16F\u003e.Load linear: 42.915ms 1.000x\nTexture2D\u003cR16F\u003e.Load random: 42.866ms 1.002x\nTexture2D\u003cRG16F\u003e.Load uniform: 48.234ms 0.890x\nTexture2D\u003cRG16F\u003e.Load linear: 48.365ms 0.888x\nTexture2D\u003cRG16F\u003e.Load random: 48.220ms 0.890x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 42.911ms 1.001x\nTexture2D\u003cRGBA16F\u003e.Load linear: 42.943ms 1.000x\nTexture2D\u003cRGBA16F\u003e.Load random: 85.655ms 0.501x\nTexture2D\u003cR32F\u003e.Load uniform: 42.896ms 1.001x\nTexture2D\u003cR32F\u003e.Load linear: 42.910ms 1.001x\nTexture2D\u003cR32F\u003e.Load random: 42.871ms 1.001x\nTexture2D\u003cRG32F\u003e.Load uniform: 48.239ms 0.890x\nTexture2D\u003cRG32F\u003e.Load linear: 48.367ms 0.888x\nTexture2D\u003cRG32F\u003e.Load random: 85.634ms 0.501x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 42.927ms 1.000x\nTexture2D\u003cRGBA32F\u003e.Load linear: 64.284ms 0.668x\nTexture2D\u003cRGBA32F\u003e.Load random: 85.638ms 0.501x\n```\n**AMD GCN4** results (ratios) are identical to GCN2/3. See GCN2 for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between different GCN revisions.\n\n### AMD GCN5 (Vega Frontier Edition)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 6.024ms 3.693x\nBuffer\u003cR8\u003e.Load linear: 5.798ms 3.838x\nBuffer\u003cR8\u003e.Load random: 21.411ms 1.039x\nBuffer\u003cRG8\u003e.Load uniform: 21.648ms 1.028x\nBuffer\u003cRG8\u003e.Load linear: 21.108ms 1.054x\nBuffer\u003cRG8\u003e.Load random: 21.721ms 1.024x\nBuffer\u003cRGBA8\u003e.Load uniform: 22.315ms 0.997x\nBuffer\u003cRGBA8\u003e.Load linear: 22.055ms 1.009x\nBuffer\u003cRGBA8\u003e.Load random: 22.251ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 6.421ms 3.465x\nBuffer\u003cR16f\u003e.Load linear: 6.119ms 3.636x\nBuffer\u003cR16f\u003e.Load random: 21.534ms 1.033x\nBuffer\u003cRG16f\u003e.Load uniform: 21.010ms 1.059x\nBuffer\u003cRG16f\u003e.Load linear: 20.785ms 1.071x\nBuffer\u003cRG16f\u003e.Load random: 20.903ms 1.064x\nBuffer\u003cRGBA16f\u003e.Load uniform: 21.083ms 1.055x\nBuffer\u003cRGBA16f\u003e.Load linear: 22.849ms 0.974x\nBuffer\u003cRGBA16f\u003e.Load random: 22.189ms 1.003x\nBuffer\u003cR32f\u003e.Load uniform: 6.374ms 3.491x\nBuffer\u003cR32f\u003e.Load linear: 6.265ms 3.552x\nBuffer\u003cR32f\u003e.Load random: 21.892ms 1.016x\nBuffer\u003cRG32f\u003e.Load uniform: 21.918ms 1.015x\nBuffer\u003cRG32f\u003e.Load linear: 21.081ms 1.056x\nBuffer\u003cRG32f\u003e.Load random: 22.866ms 0.973x\nBuffer\u003cRGBA32f\u003e.Load uniform: 22.022ms 1.010x\nBuffer\u003cRGBA32f\u003e.Load linear: 22.025ms 1.010x\nBuffer\u003cRGBA32f\u003e.Load random: 24.889ms 0.894x\nByteAddressBuffer.Load uniform: 5.187ms 4.289x\nByteAddressBuffer.Load linear: 6.682ms 3.330x\nByteAddressBuffer.Load random: 22.153ms 1.004x\nByteAddressBuffer.Load2 uniform: 5.907ms 3.767x\nByteAddressBuffer.Load2 linear: 21.541ms 1.033x\nByteAddressBuffer.Load2 random: 22.435ms 0.992x\nByteAddressBuffer.Load3 uniform: 8.896ms 2.501x\nByteAddressBuffer.Load3 linear: 22.019ms 1.011x\nByteAddressBuffer.Load3 random: 43.438ms 0.512x\nByteAddressBuffer.Load4 uniform: 10.671ms 2.085x\nByteAddressBuffer.Load4 linear: 20.912ms 1.064x\nByteAddressBuffer.Load4 random: 23.508ms 0.947x\nByteAddressBuffer.Load2 unaligned uniform: 6.080ms 3.660x\nByteAddressBuffer.Load2 unaligned linear: 21.813ms 1.020x\nByteAddressBuffer.Load2 unaligned random: 22.436ms 0.992x\nByteAddressBuffer.Load4 unaligned uniform: 11.457ms 1.942x\nByteAddressBuffer.Load4 unaligned linear: 21.817ms 1.020x\nByteAddressBuffer.Load4 unaligned random: 27.530ms 0.808x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 6.384ms 3.486x\nStructuredBuffer\u003cfloat\u003e.Load linear: 6.314ms 3.524x\nStructuredBuffer\u003cfloat\u003e.Load random: 21.424ms 1.039x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 6.257ms 3.556x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 20.940ms 1.063x\nStructuredBuffer\u003cfloat2\u003e.Load random: 23.044ms 0.966x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 6.620ms 3.361x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 21.771ms 1.022x\nStructuredBuffer\u003cfloat4\u003e.Load random: 25.229ms 0.882x\ncbuffer{float4} load uniform: 8.011ms 2.778x\ncbuffer{float4} load linear: 22.951ms 0.969x\ncbuffer{float4} load random: 24.806ms 0.897x\nTexture2D\u003cR8\u003e.Load uniform: 22.585ms 0.985x\nTexture2D\u003cR8\u003e.Load linear: 21.733ms 1.024x\nTexture2D\u003cR8\u003e.Load random: 21.371ms 1.041x\nTexture2D\u003cRG8\u003e.Load uniform: 20.774ms 1.071x\nTexture2D\u003cRG8\u003e.Load linear: 20.806ms 1.069x\nTexture2D\u003cRG8\u003e.Load random: 22.936ms 0.970x\nTexture2D\u003cRGBA8\u003e.Load uniform: 22.022ms 1.010x\nTexture2D\u003cRGBA8\u003e.Load linear: 21.644ms 1.028x\nTexture2D\u003cRGBA8\u003e.Load random: 22.586ms 0.985x\nTexture2D\u003cR16F\u003e.Load uniform: 22.620ms 0.984x\nTexture2D\u003cR16F\u003e.Load linear: 22.730ms 0.979x\nTexture2D\u003cR16F\u003e.Load random: 21.356ms 1.042x\nTexture2D\u003cRG16F\u003e.Load uniform: 20.722ms 1.074x\nTexture2D\u003cRG16F\u003e.Load linear: 20.723ms 1.074x\nTexture2D\u003cRG16F\u003e.Load random: 21.893ms 1.016x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 22.287ms 0.998x\nTexture2D\u003cRGBA16F\u003e.Load linear: 22.116ms 1.006x\nTexture2D\u003cRGBA16F\u003e.Load random: 42.739ms 0.521x\nTexture2D\u003cR32F\u003e.Load uniform: 21.325ms 1.043x\nTexture2D\u003cR32F\u003e.Load linear: 21.370ms 1.041x\nTexture2D\u003cR32F\u003e.Load random: 21.393ms 1.040x\nTexture2D\u003cRG32F\u003e.Load uniform: 20.747ms 1.072x\nTexture2D\u003cRG32F\u003e.Load linear: 20.754ms 1.072x\nTexture2D\u003cRG32F\u003e.Load random: 41.415ms 0.537x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 20.551ms 1.083x\nTexture2D\u003cRGBA32F\u003e.Load linear: 31.748ms 0.701x\nTexture2D\u003cRGBA32F\u003e.Load random: 42.097ms 0.529x\n```\n**AMD GCN5** results (ratios) are identical to GCN2/3/4. See GCN2 for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between different GCN revisions.\n\n### AMD GCN5 7nm (Radeon VII)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 5.214ms 3.667x\nBuffer\u003cR8\u003e.Load linear: 5.332ms 3.586x\nBuffer\u003cR8\u003e.Load random: 18.861ms 1.014x\nBuffer\u003cRG8\u003e.Load uniform: 18.917ms 1.011x\nBuffer\u003cRG8\u003e.Load linear: 18.904ms 1.011x\nBuffer\u003cRG8\u003e.Load random: 18.885ms 1.012x\nBuffer\u003cRGBA8\u003e.Load uniform: 18.882ms 1.013x\nBuffer\u003cRGBA8\u003e.Load linear: 19.074ms 1.002x\nBuffer\u003cRGBA8\u003e.Load random: 19.119ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 5.335ms 3.584x\nBuffer\u003cR16f\u003e.Load linear: 5.547ms 3.447x\nBuffer\u003cR16f\u003e.Load random: 18.872ms 1.013x\nBuffer\u003cRG16f\u003e.Load uniform: 19.080ms 1.002x\nBuffer\u003cRG16f\u003e.Load linear: 18.911ms 1.011x\nBuffer\u003cRG16f\u003e.Load random: 18.996ms 1.007x\nBuffer\u003cRGBA16f\u003e.Load uniform: 18.879ms 1.013x\nBuffer\u003cRGBA16f\u003e.Load linear: 19.340ms 0.989x\nBuffer\u003cRGBA16f\u003e.Load random: 18.985ms 1.007x\nBuffer\u003cR32f\u003e.Load uniform: 5.337ms 3.582x\nBuffer\u003cR32f\u003e.Load linear: 5.548ms 3.446x\nBuffer\u003cR32f\u003e.Load random: 18.873ms 1.013x\nBuffer\u003cRG32f\u003e.Load uniform: 19.130ms 0.999x\nBuffer\u003cRG32f\u003e.Load linear: 18.934ms 1.010x\nBuffer\u003cRG32f\u003e.Load random: 19.100ms 1.001x\nBuffer\u003cRGBA32f\u003e.Load uniform: 18.880ms 1.013x\nBuffer\u003cRGBA32f\u003e.Load linear: 19.383ms 0.986x\nBuffer\u003cRGBA32f\u003e.Load random: 21.310ms 0.897x\nByteAddressBuffer.Load uniform: 4.285ms 4.462x\nByteAddressBuffer.Load linear: 5.542ms 3.450x\nByteAddressBuffer.Load random: 18.869ms 1.013x\nByteAddressBuffer.Load2 uniform: 5.209ms 3.671x\nByteAddressBuffer.Load2 linear: 19.266ms 0.992x\nByteAddressBuffer.Load2 random: 19.005ms 1.006x\nByteAddressBuffer.Load3 uniform: 7.454ms 2.565x\nByteAddressBuffer.Load3 linear: 19.190ms 0.996x\nByteAddressBuffer.Load3 random: 37.705ms 0.507x\nByteAddressBuffer.Load4 uniform: 9.604ms 1.991x\nByteAddressBuffer.Load4 linear: 19.455ms 0.983x\nByteAddressBuffer.Load4 random: 21.360ms 0.895x\nByteAddressBuffer.Load2 unaligned uniform: 5.083ms 3.761x\nByteAddressBuffer.Load2 unaligned linear: 19.190ms 0.996x\nByteAddressBuffer.Load2 unaligned random: 18.920ms 1.011x\nByteAddressBuffer.Load4 unaligned uniform: 9.600ms 1.992x\nByteAddressBuffer.Load4 unaligned linear: 19.234ms 0.994x\nByteAddressBuffer.Load4 unaligned random: 23.485ms 0.814x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 5.360ms 3.567x\nStructuredBuffer\u003cfloat\u003e.Load linear: 5.335ms 3.584x\nStructuredBuffer\u003cfloat\u003e.Load random: 18.879ms 1.013x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 5.494ms 3.480x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 18.943ms 1.009x\nStructuredBuffer\u003cfloat2\u003e.Load random: 18.898ms 1.012x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 5.576ms 3.429x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 19.237ms 0.994x\nStructuredBuffer\u003cfloat4\u003e.Load random: 21.314ms 0.897x\ncbuffer{float4} load uniform: 6.819ms 2.804x\ncbuffer{float4} load linear: 19.731ms 0.969x\ncbuffer{float4} load random: 21.368ms 0.895x\nTexture2D\u003cR8\u003e.Load uniform: 18.917ms 1.011x\nTexture2D\u003cR8\u003e.Load linear: 19.067ms 1.003x\nTexture2D\u003cR8\u003e.Load random: 18.925ms 1.010x\nTexture2D\u003cRG8\u003e.Load uniform: 18.902ms 1.011x\nTexture2D\u003cRG8\u003e.Load linear: 18.952ms 1.009x\nTexture2D\u003cRG8\u003e.Load random: 18.888ms 1.012x\nTexture2D\u003cRGBA8\u003e.Load uniform: 19.000ms 1.006x\nTexture2D\u003cRGBA8\u003e.Load linear: 19.137ms 0.999x\nTexture2D\u003cRGBA8\u003e.Load random: 18.965ms 1.008x\nTexture2D\u003cR16F\u003e.Load uniform: 18.919ms 1.011x\nTexture2D\u003cR16F\u003e.Load linear: 18.936ms 1.010x\nTexture2D\u003cR16F\u003e.Load random: 19.034ms 1.004x\nTexture2D\u003cRG16F\u003e.Load uniform: 18.910ms 1.011x\nTexture2D\u003cRG16F\u003e.Load linear: 18.971ms 1.008x\nTexture2D\u003cRG16F\u003e.Load random: 18.895ms 1.012x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 18.918ms 1.011x\nTexture2D\u003cRGBA16F\u003e.Load linear: 19.094ms 1.001x\nTexture2D\u003cRGBA16F\u003e.Load random: 37.952ms 0.504x\nTexture2D\u003cR32F\u003e.Load uniform: 18.904ms 1.011x\nTexture2D\u003cR32F\u003e.Load linear: 19.040ms 1.004x\nTexture2D\u003cR32F\u003e.Load random: 19.053ms 1.003x\nTexture2D\u003cRG32F\u003e.Load uniform: 18.942ms 1.009x\nTexture2D\u003cRG32F\u003e.Load linear: 18.999ms 1.006x\nTexture2D\u003cRG32F\u003e.Load random: 37.708ms 0.507x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 18.919ms 1.011x\nTexture2D\u003cRGBA32F\u003e.Load linear: 28.305ms 0.675x\nTexture2D\u003cRGBA32F\u003e.Load random: 37.705ms 0.507x\n```\n**AMD GCN5 7nm** is a 7nm die shink of Vega. Has higher clocks, but four disabled CUs (60 vs 64). Results (ratios) are identical to GCN2/3/4/5. See GCN2 for analysis.\n\n### AMD Navi (RX 5700 XT)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 5.289ms 2.385x\nBuffer\u003cR8\u003e.Load linear: 4.874ms 2.588x\nBuffer\u003cR8\u003e.Load random: 4.656ms 2.710x\nBuffer\u003cRG8\u003e.Load uniform: 5.986ms 2.108x\nBuffer\u003cRG8\u003e.Load linear: 6.514ms 1.937x\nBuffer\u003cRG8\u003e.Load random: 6.115ms 2.063x\nBuffer\u003cRGBA8\u003e.Load uniform: 12.519ms 1.008x\nBuffer\u003cRGBA8\u003e.Load linear: 12.985ms 0.972x\nBuffer\u003cRGBA8\u003e.Load random: 12.617ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 4.769ms 2.645x\nBuffer\u003cR16f\u003e.Load linear: 4.599ms 2.744x\nBuffer\u003cR16f\u003e.Load random: 4.687ms 2.692x\nBuffer\u003cRG16f\u003e.Load uniform: 6.210ms 2.032x\nBuffer\u003cRG16f\u003e.Load linear: 6.164ms 2.047x\nBuffer\u003cRG16f\u003e.Load random: 6.170ms 2.045x\nBuffer\u003cRGBA16f\u003e.Load uniform: 12.838ms 0.983x\nBuffer\u003cRGBA16f\u003e.Load linear: 13.138ms 0.960x\nBuffer\u003cRGBA16f\u003e.Load random: 12.725ms 0.991x\nBuffer\u003cR32f\u003e.Load uniform: 4.818ms 2.619x\nBuffer\u003cR32f\u003e.Load linear: 4.697ms 2.686x\nBuffer\u003cR32f\u003e.Load random: 4.771ms 2.644x\nBuffer\u003cRG32f\u003e.Load uniform: 6.295ms 2.004x\nBuffer\u003cRG32f\u003e.Load linear: 6.223ms 2.027x\nBuffer\u003cRG32f\u003e.Load random: 6.217ms 2.029x\nBuffer\u003cRGBA32f\u003e.Load uniform: 13.099ms 0.963x\nBuffer\u003cRGBA32f\u003e.Load linear: 13.312ms 0.948x\nBuffer\u003cRGBA32f\u003e.Load random: 12.819ms 0.984x\nByteAddressBuffer.Load uniform: 7.299ms 1.728x\nByteAddressBuffer.Load linear: 6.361ms 1.983x\nByteAddressBuffer.Load random: 6.279ms 2.009x\nByteAddressBuffer.Load2 uniform: 6.913ms 1.825x\nByteAddressBuffer.Load2 linear: 9.648ms 1.308x\nByteAddressBuffer.Load2 random: 9.693ms 1.302x\nByteAddressBuffer.Load3 uniform: 9.650ms 1.307x\nByteAddressBuffer.Load3 linear: 13.069ms 0.965x\nByteAddressBuffer.Load3 random: 26.009ms 0.485x\nByteAddressBuffer.Load4 uniform: 12.956ms 0.974x\nByteAddressBuffer.Load4 linear: 16.076ms 0.785x\nByteAddressBuffer.Load4 random: 16.332ms 0.773x\nByteAddressBuffer.Load2 unaligned uniform: 7.340ms 1.719x\nByteAddressBuffer.Load2 unaligned linear: 12.697ms 0.994x\nByteAddressBuffer.Load2 unaligned random: 12.598ms 1.001x\nByteAddressBuffer.Load4 unaligned uniform: 13.019ms 0.969x\nByteAddressBuffer.Load4 unaligned linear: 19.027ms 0.663x\nByteAddressBuffer.Load4 unaligned random: 25.387ms 0.497x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 9.047ms 1.395x\nStructuredBuffer\u003cfloat\u003e.Load linear: 5.461ms 2.310x\nStructuredBuffer\u003cfloat\u003e.Load random: 4.722ms 2.672x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 8.770ms 1.439x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 6.795ms 1.857x\nStructuredBuffer\u003cfloat2\u003e.Load random: 6.074ms 2.077x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 9.013ms 1.400x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 12.948ms 0.974x\nStructuredBuffer\u003cfloat4\u003e.Load random: 12.428ms 1.015x\ncbuffer{float4} load uniform: 9.561ms 1.320x\ncbuffer{float4} load linear: 13.446ms 0.938x\ncbuffer{float4} load random: 12.445ms 1.014x\nTexture2D\u003cR8\u003e.Load uniform: 6.537ms 1.930x\nTexture2D\u003cR8\u003e.Load linear: 6.652ms 1.897x\nTexture2D\u003cR8\u003e.Load random: 6.474ms 1.949x\nTexture2D\u003cRG8\u003e.Load uniform: 6.652ms 1.897x\nTexture2D\u003cRG8\u003e.Load linear: 6.606ms 1.910x\nTexture2D\u003cRG8\u003e.Load random: 6.644ms 1.899x\nTexture2D\u003cRGBA8\u003e.Load uniform: 12.992ms 0.971x\nTexture2D\u003cRGBA8\u003e.Load linear: 13.012ms 0.970x\nTexture2D\u003cRGBA8\u003e.Load random: 12.877ms 0.980x\nTexture2D\u003cR16F\u003e.Load uniform: 6.655ms 1.896x\nTexture2D\u003cR16F\u003e.Load linear: 6.596ms 1.913x\nTexture2D\u003cR16F\u003e.Load random: 6.476ms 1.948x\nTexture2D\u003cRG16F\u003e.Load uniform: 6.612ms 1.908x\nTexture2D\u003cRG16F\u003e.Load linear: 6.697ms 1.884x\nTexture2D\u003cRG16F\u003e.Load random: 6.436ms 1.960x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 12.956ms 0.974x\nTexture2D\u003cRGBA16F\u003e.Load linear: 12.988ms 0.971x\nTexture2D\u003cRGBA16F\u003e.Load random: 12.856ms 0.981x\nTexture2D\u003cR32F\u003e.Load uniform: 6.651ms 1.897x\nTexture2D\u003cR32F\u003e.Load linear: 6.732ms 1.874x\nTexture2D\u003cR32F\u003e.Load random: 6.469ms 1.950x\nTexture2D\u003cRG32F\u003e.Load uniform: 6.627ms 1.904x\nTexture2D\u003cRG32F\u003e.Load linear: 12.954ms 0.974x\nTexture2D\u003cRG32F\u003e.Load random: 6.450ms 1.956x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 12.949ms 0.974x\nTexture2D\u003cRGBA32F\u003e.Load linear: 12.953ms 0.974x\nTexture2D\u003cRGBA32F\u003e.Load random: 12.804ms 0.985x\n```\n**AMD Navi** TODO.\n\n### NVidia Maxwell (GTX 980 Ti)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 1.249ms 28.812x\nBuffer\u003cR8\u003e.Load linear: 34.105ms 1.055x\nBuffer\u003cR8\u003e.Load random: 34.187ms 1.053x\nBuffer\u003cRG8\u003e.Load uniform: 1.847ms 19.485x\nBuffer\u003cRG8\u003e.Load linear: 34.106ms 1.055x\nBuffer\u003cRG8\u003e.Load random: 34.477ms 1.044x\nBuffer\u003cRGBA8\u003e.Load uniform: 2.452ms 14.680x\nBuffer\u003cRGBA8\u003e.Load linear: 35.773ms 1.006x\nBuffer\u003cRGBA8\u003e.Load random: 35.996ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 1.491ms 24.148x\nBuffer\u003cR16f\u003e.Load linear: 34.077ms 1.056x\nBuffer\u003cR16f\u003e.Load random: 34.463ms 1.044x\nBuffer\u003cRG16f\u003e.Load uniform: 1.916ms 18.785x\nBuffer\u003cRG16f\u003e.Load linear: 34.229ms 1.052x\nBuffer\u003cRG16f\u003e.Load random: 34.597ms 1.040x\nBuffer\u003cRGBA16f\u003e.Load uniform: 2.519ms 14.291x\nBuffer\u003cRGBA16f\u003e.Load linear: 35.787ms 1.006x\nBuffer\u003cRGBA16f\u003e.Load random: 35.996ms 1.000x\nBuffer\u003cR32f\u003e.Load uniform: 1.478ms 24.350x\nBuffer\u003cR32f\u003e.Load linear: 34.098ms 1.056x\nBuffer\u003cR32f\u003e.Load random: 34.353ms 1.048x\nBuffer\u003cRG32f\u003e.Load uniform: 1.845ms 19.514x\nBuffer\u003cRG32f\u003e.Load linear: 34.138ms 1.054x\nBuffer\u003cRG32f\u003e.Load random: 34.495ms 1.044x\nBuffer\u003cRGBA32f\u003e.Load uniform: 2.374ms 15.163x\nBuffer\u003cRGBA32f\u003e.Load linear: 67.973ms 0.530x\nBuffer\u003cRGBA32f\u003e.Load random: 68.054ms 0.529x\nByteAddressBuffer.Load uniform: 21.403ms 1.682x\nByteAddressBuffer.Load linear: 21.906ms 1.643x\nByteAddressBuffer.Load random: 24.336ms 1.479x\nByteAddressBuffer.Load2 uniform: 45.620ms 0.789x\nByteAddressBuffer.Load2 linear: 55.815ms 0.645x\nByteAddressBuffer.Load2 random: 48.744ms 0.738x\nByteAddressBuffer.Load3 uniform: 52.929ms 0.680x\nByteAddressBuffer.Load3 linear: 79.057ms 0.455x\nByteAddressBuffer.Load3 random: 93.636ms 0.384x\nByteAddressBuffer.Load4 uniform: 68.510ms 0.525x\nByteAddressBuffer.Load4 linear: 114.561ms 0.314x\nByteAddressBuffer.Load4 random: 209.280ms 0.172x\nByteAddressBuffer.Load2 unaligned uniform: 45.640ms 0.789x\nByteAddressBuffer.Load2 unaligned linear: 55.802ms 0.645x\nByteAddressBuffer.Load2 unaligned random: 48.717ms 0.739x\nByteAddressBuffer.Load4 unaligned uniform: 68.685ms 0.524x\nByteAddressBuffer.Load4 unaligned linear: 115.244ms 0.312x\nByteAddressBuffer.Load4 unaligned random: 210.358ms 0.171x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 1.116ms 32.267x\nStructuredBuffer\u003cfloat\u003e.Load linear: 34.094ms 1.056x\nStructuredBuffer\u003cfloat\u003e.Load random: 34.092ms 1.056x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 1.569ms 22.942x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 34.143ms 1.054x\nStructuredBuffer\u003cfloat2\u003e.Load random: 34.125ms 1.055x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 2.087ms 17.245x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 67.959ms 0.530x\nStructuredBuffer\u003cfloat4\u003e.Load random: 67.950ms 0.530x\ncbuffer{float4} load uniform: 1.298ms 27.733x\ncbuffer{float4} load linear: 798.703ms 0.045x\ncbuffer{float4} load random: 324.356ms 0.111x\nTexture2D\u003cR8\u003e.Load uniform: 1.962ms 18.351x\nTexture2D\u003cR8\u003e.Load linear: 34.027ms 1.058x\nTexture2D\u003cR8\u003e.Load random: 34.029ms 1.058x\nTexture2D\u003cRG8\u003e.Load uniform: 1.994ms 18.054x\nTexture2D\u003cRG8\u003e.Load linear: 34.334ms 1.048x\nTexture2D\u003cRG8\u003e.Load random: 34.102ms 1.056x\nTexture2D\u003cRGBA8\u003e.Load uniform: 2.247ms 16.018x\nTexture2D\u003cRGBA8\u003e.Load linear: 36.077ms 0.998x\nTexture2D\u003cRGBA8\u003e.Load random: 35.930ms 1.002x\nTexture2D\u003cR16F\u003e.Load uniform: 2.021ms 17.814x\nTexture2D\u003cR16F\u003e.Load linear: 34.040ms 1.057x\nTexture2D\u003cR16F\u003e.Load random: 34.021ms 1.058x\nTexture2D\u003cRG16F\u003e.Load uniform: 2.020ms 17.822x\nTexture2D\u003cRG16F\u003e.Load linear: 34.308ms 1.049x\nTexture2D\u003cRG16F\u003e.Load random: 34.095ms 1.056x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 2.199ms 16.372x\nTexture2D\u003cRGBA16F\u003e.Load linear: 36.074ms 0.998x\nTexture2D\u003cRGBA16F\u003e.Load random: 68.064ms 0.529x\nTexture2D\u003cR32F\u003e.Load uniform: 2.014ms 17.869x\nTexture2D\u003cR32F\u003e.Load linear: 34.042ms 1.057x\nTexture2D\u003cR32F\u003e.Load random: 34.028ms 1.058x\nTexture2D\u003cRG32F\u003e.Load uniform: 1.981ms 18.166x\nTexture2D\u003cRG32F\u003e.Load linear: 34.320ms 1.049x\nTexture2D\u003cRG32F\u003e.Load random: 67.948ms 0.530x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 2.064ms 17.440x\nTexture2D\u003cRGBA32F\u003e.Load linear: 67.974ms 0.530x\nTexture2D\u003cRGBA32F\u003e.Load random: 68.049ms 0.529x\n```\n\n**Typed loads:** Maxwell doesn't coalesce any typed loads. Dimensions (1d/2d/4d) and channel widths (8b/16b/32b) don't directly affect performance. All up to 64 bit loads are full rate. 128 bit loads are half rate (only RGBA32). Best bytes per cycle rate can be achieved by 64+ bit loads (RGBA16, RG32, RGBA32).\n\n**Raw (ByteAddressBuffer) loads:** Oddly we see no coalescing here either. CUDA code shows big performance improvement with similar linear access pattern. All 1d raw loads are as fast as typed buffer loads. However NV doesn't seem to emit wide raw loads either. 2d is exactly 2x slower, 3d is 3x slower and 4d is 4x slower than 1d. NVIDIA supports 64 bit and 128 wide raw loads in CUDA, see: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/. Wide loads in CUDA however require memory alignment (8/16 bytes). My test case is perfectly aligned, but HLSL ByteAddressBuffer.Load4() specification only requires alignment of 4. In general case it's hard to prove alignment of 16 (in my code there's an explicit multiply address by 16).\n\n**Structured buffer loads:** Structured buffer loads guarantee natural alignment. Nvidia has full rate 1d and 2d structured buffer loads. But 4d loads (128 bit) are half rate as usual.\n\n**Texture loads:** Similar performance as typed buffer loads. Random access of wide formats tends to be slightly slower (but my 2d random produces different access pattern than 1d).\n\n**Cbuffer loads:** Nvidia Maxwell (and newer GPUs) have a special constant buffer hardware unit. Uniform address constant buffer loads are up to 32x faster (warp width) than standard memory loads. However non-uniform constant buffer loads are dead slow. Nvidia CUDA documents tell us that constant buffer load gets serialized for each unique address. Thus we can see up to 32x performance drop compared to best case. But in my test case (each lane = different address), we se up to 200x slow down. This result tells us that there's likely a small constant buffer cache on each SM, and if your access pattern is bad enough, this cache starts to trash badly. Unfortunately Nvidia doesn't provide us a public document describing best practices to avoid this pitfall.\n\n**Uniform address optimization:** New Nvidia drivers introduced a shader compiler based uniform address load optimization for loops. This speeds up the loads in these cases by up to 28x (close to 32x theoretical warp width). This new optimization is awesome, because previously Nvidia had to fully lean to their constant buffer hardware for good uniform load performance. As we all know constant buffers are very limited (vec4 arrays only and 64KB size limit). See \"Uniform Address Load Investigation\" chapter for more info.\n\n**Suggestions:** Prefer 64+ bit typed loads (RGBA16, RG32, RGBA32). ByteAddressBuffer wide loads and coalescing doesn't seem to work in DirectX. Uniform address loads (inside loop using loop index) have a fast path. Use it whenever possible. This results in similar performance as using constant buffer hardware, but supports all buffer and texture types. No size or alignment limitations!\n\n### NVidia Pascal (GTX 1070 Ti)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 0.845ms 36.835x\nBuffer\u003cR8\u003e.Load linear: 29.335ms 1.061x\nBuffer\u003cR8\u003e.Load random: 28.981ms 1.074x\nBuffer\u003cRG8\u003e.Load uniform: 1.151ms 27.036x\nBuffer\u003cRG8\u003e.Load linear: 30.267ms 1.028x\nBuffer\u003cRG8\u003e.Load random: 29.359ms 1.060x\nBuffer\u003cRGBA8\u003e.Load uniform: 1.534ms 20.286x\nBuffer\u003cRGBA8\u003e.Load linear: 31.214ms 0.997x\nBuffer\u003cRGBA8\u003e.Load random: 31.118ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 0.808ms 38.516x\nBuffer\u003cR16f\u003e.Load linear: 28.943ms 1.075x\nBuffer\u003cR16f\u003e.Load random: 29.870ms 1.042x\nBuffer\u003cRG16f\u003e.Load uniform: 1.119ms 27.803x\nBuffer\u003cRG16f\u003e.Load linear: 29.458ms 1.056x\nBuffer\u003cRG16f\u003e.Load random: 29.904ms 1.041x\nBuffer\u003cRGBA16f\u003e.Load uniform: 1.467ms 21.207x\nBuffer\u003cRGBA16f\u003e.Load linear: 31.222ms 0.997x\nBuffer\u003cRGBA16f\u003e.Load random: 30.223ms 1.030x\nBuffer\u003cR32f\u003e.Load uniform: 0.847ms 36.746x\nBuffer\u003cR32f\u003e.Load linear: 30.240ms 1.029x\nBuffer\u003cR32f\u003e.Load random: 28.963ms 1.074x\nBuffer\u003cRG32f\u003e.Load uniform: 1.087ms 28.615x\nBuffer\u003cRG32f\u003e.Load linear: 30.391ms 1.024x\nBuffer\u003cRG32f\u003e.Load random: 29.475ms 1.056x\nBuffer\u003cRGBA32f\u003e.Load uniform: 1.434ms 21.706x\nBuffer\u003cRGBA32f\u003e.Load linear: 59.394ms 0.524x\nBuffer\u003cRGBA32f\u003e.Load random: 57.593ms 0.540x\nByteAddressBuffer.Load uniform: 18.151ms 1.714x\nByteAddressBuffer.Load linear: 18.451ms 1.686x\nByteAddressBuffer.Load random: 21.305ms 1.461x\nByteAddressBuffer.Load2 uniform: 41.123ms 0.757x\nByteAddressBuffer.Load2 linear: 40.461ms 0.769x\nByteAddressBuffer.Load2 random: 49.244ms 0.632x\nByteAddressBuffer.Load3 uniform: 44.836ms 0.694x\nByteAddressBuffer.Load3 linear: 65.966ms 0.472x\nByteAddressBuffer.Load3 random: 77.712ms 0.400x\nByteAddressBuffer.Load4 uniform: 58.439ms 0.532x\nByteAddressBuffer.Load4 linear: 97.260ms 0.320x\nByteAddressBuffer.Load4 random: 174.779ms 0.178x\nByteAddressBuffer.Load2 unaligned uniform: 41.147ms 0.756x\nByteAddressBuffer.Load2 unaligned linear: 40.483ms 0.769x\nByteAddressBuffer.Load2 unaligned random: 55.911ms 0.557x\nByteAddressBuffer.Load4 unaligned uniform: 58.126ms 0.535x\nByteAddressBuffer.Load4 unaligned linear: 99.081ms 0.314x\nByteAddressBuffer.Load4 unaligned random: 179.514ms 0.173x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 0.887ms 35.091x\nStructuredBuffer\u003cfloat\u003e.Load linear: 29.878ms 1.042x\nStructuredBuffer\u003cfloat\u003e.Load random: 29.408ms 1.058x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 1.141ms 27.279x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 30.575ms 1.018x\nStructuredBuffer\u003cfloat2\u003e.Load random: 28.985ms 1.074x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 1.523ms 20.436x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 58.493ms 0.532x\nStructuredBuffer\u003cfloat4\u003e.Load random: 58.546ms 0.532x\ncbuffer{float4} load uniform: 1.390ms 22.394x\ncbuffer{float4} load linear: 684.120ms 0.045x\ncbuffer{float4} load random: 273.085ms 0.114x\nTexture2D\u003cR8\u003e.Load uniform: 1.627ms 19.125x\nTexture2D\u003cR8\u003e.Load linear: 28.924ms 1.076x\nTexture2D\u003cR8\u003e.Load random: 28.923ms 1.076x\nTexture2D\u003cRG8\u003e.Load uniform: 1.378ms 22.577x\nTexture2D\u003cRG8\u003e.Load linear: 29.041ms 1.072x\nTexture2D\u003cRG8\u003e.Load random: 28.938ms 1.075x\nTexture2D\u003cRGBA8\u003e.Load uniform: 1.563ms 19.914x\nTexture2D\u003cRGBA8\u003e.Load linear: 30.666ms 1.015x\nTexture2D\u003cRGBA8\u003e.Load random: 30.334ms 1.026x\nTexture2D\u003cR16F\u003e.Load uniform: 1.313ms 23.704x\nTexture2D\u003cR16F\u003e.Load linear: 28.961ms 1.074x\nTexture2D\u003cR16F\u003e.Load random: 28.968ms 1.074x\nTexture2D\u003cRG16F\u003e.Load uniform: 1.360ms 22.883x\nTexture2D\u003cRG16F\u003e.Load linear: 29.048ms 1.071x\nTexture2D\u003cRG16F\u003e.Load random: 28.926ms 1.076x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 1.501ms 20.729x\nTexture2D\u003cRGBA16F\u003e.Load linear: 30.649ms 1.015x\nTexture2D\u003cRGBA16F\u003e.Load random: 57.629ms 0.540x\nTexture2D\u003cR32F\u003e.Load uniform: 1.384ms 22.477x\nTexture2D\u003cR32F\u003e.Load linear: 28.955ms 1.075x\nTexture2D\u003cR32F\u003e.Load random: 28.968ms 1.074x\nTexture2D\u003cRG32F\u003e.Load uniform: 1.408ms 22.101x\nTexture2D\u003cRG32F\u003e.Load linear: 29.056ms 1.071x\nTexture2D\u003cRG32F\u003e.Load random: 57.672ms 0.540x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 1.538ms 20.232x\nTexture2D\u003cRGBA32F\u003e.Load linear: 57.653ms 0.540x\nTexture2D\u003cRGBA32F\u003e.Load random: 57.557ms 0.541x\n```\n**NVIDIA Pascal** results (ratios) are identical to Maxwell. See Maxwell for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between Maxwell and Pascal.\n\n### NVIDIA Kepler (600/700 series)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 3.073ms 62.440x\nBuffer\u003cR8\u003e.Load linear: 195.662ms 0.981x\nBuffer\u003cR8\u003e.Load random: 197.022ms 0.974x\nBuffer\u003cRG8\u003e.Load uniform: 3.227ms 59.465x\nBuffer\u003cRG8\u003e.Load linear: 195.179ms 0.983x\nBuffer\u003cRG8\u003e.Load random: 196.785ms 0.975x\nBuffer\u003cRGBA8\u003e.Load uniform: 3.598ms 53.329x\nBuffer\u003cRGBA8\u003e.Load linear: 193.676ms 0.991x\nBuffer\u003cRGBA8\u003e.Load random: 191.866ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 3.031ms 63.308x\nBuffer\u003cR16f\u003e.Load linear: 195.622ms 0.981x\nBuffer\u003cR16f\u003e.Load random: 197.009ms 0.974x\nBuffer\u003cRG16f\u003e.Load uniform: 3.025ms 63.434x\nBuffer\u003cRG16f\u003e.Load linear: 195.135ms 0.983x\nBuffer\u003cRG16f\u003e.Load random: 196.860ms 0.975x\nBuffer\u003cRGBA16f\u003e.Load uniform: 3.443ms 55.728x\nBuffer\u003cRGBA16f\u003e.Load linear: 193.744ms 0.990x\nBuffer\u003cRGBA16f\u003e.Load random: 191.929ms 1.000x\nBuffer\u003cR32f\u003e.Load uniform: 2.970ms 64.605x\nBuffer\u003cR32f\u003e.Load linear: 195.751ms 0.980x\nBuffer\u003cR32f\u003e.Load random: 197.141ms 0.973x\nBuffer\u003cRG32f\u003e.Load uniform: 3.175ms 60.425x\nBuffer\u003cRG32f\u003e.Load linear: 195.351ms 0.982x\nBuffer\u003cRG32f\u003e.Load random: 196.911ms 0.974x\nBuffer\u003cRGBA32f\u003e.Load uniform: 3.621ms 52.985x\nBuffer\u003cRGBA32f\u003e.Load linear: 350.658ms 0.547x\nBuffer\u003cRGBA32f\u003e.Load random: 350.633ms 0.547x\nByteAddressBuffer.Load uniform: 3.758ms 51.055x\nByteAddressBuffer.Load linear: 191.898ms 1.000x\nByteAddressBuffer.Load random: 216.928ms 0.884x\nByteAddressBuffer.Load2 uniform: 4.682ms 40.977x\nByteAddressBuffer.Load2 linear: 390.852ms 0.491x\nByteAddressBuffer.Load2 random: 442.053ms 0.434x\nByteAddressBuffer.Load3 uniform: 572.822ms 0.335x\nByteAddressBuffer.Load3 linear: 568.316ms 0.338x\nByteAddressBuffer.Load3 random: 570.361ms 0.336x\nByteAddressBuffer.Load4 uniform: 752.691ms 0.255x\nByteAddressBuffer.Load4 linear: 758.795ms 0.253x\nByteAddressBuffer.Load4 random: 763.638ms 0.251x\nByteAddressBuffer.Load2 unaligned uniform: 4.199ms 45.692x\nByteAddressBuffer.Load2 unaligned linear: 391.542ms 0.490x\nByteAddressBuffer.Load2 unaligned random: 442.574ms 0.434x\nByteAddressBuffer.Load4 unaligned uniform: 752.793ms 0.255x\nByteAddressBuffer.Load4 unaligned linear: 758.698ms 0.253x\nByteAddressBuffer.Load4 unaligned random: 763.679ms 0.251x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 3.103ms 61.827x\nStructuredBuffer\u003cfloat\u003e.Load linear: 195.674ms 0.981x\nStructuredBuffer\u003cfloat\u003e.Load random: 196.991ms 0.974x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 3.301ms 58.120x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 195.167ms 0.983x\nStructuredBuffer\u003cfloat2\u003e.Load random: 196.749ms 0.975x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 3.846ms 49.882x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 350.461ms 0.547x\nStructuredBuffer\u003cfloat4\u003e.Load random: 350.494ms 0.547x\ncbuffer{float4} load uniform: 4.478ms 42.844x\ncbuffer{float4} load linear: 9217.404ms 0.021x\ncbuffer{float4} load random: 3333.476ms 0.058x\nTexture2D\u003cR8\u003e.Load uniform: 3.384ms 56.695x\nTexture2D\u003cR8\u003e.Load linear: 202.197ms 0.949x\nTexture2D\u003cR8\u003e.Load random: 204.327ms 0.939x\nTexture2D\u003cRG8\u003e.Load uniform: 3.731ms 51.424x\nTexture2D\u003cRG8\u003e.Load linear: 198.542ms 0.966x\nTexture2D\u003cRG8\u003e.Load random: 211.881ms 0.906x\nTexture2D\u003cRGBA8\u003e.Load uniform: 4.306ms 44.558x\nTexture2D\u003cRGBA8\u003e.Load linear: 196.088ms 0.978x\nTexture2D\u003cRGBA8\u003e.Load random: 195.847ms 0.980x\nTexture2D\u003cR16F\u003e.Load uniform: 3.419ms 56.118x\nTexture2D\u003cR16F\u003e.Load linear: 202.264ms 0.949x\nTexture2D\u003cR16F\u003e.Load random: 204.311ms 0.939x\nTexture2D\u003cRG16F\u003e.Load uniform: 3.673ms 52.243x\nTexture2D\u003cRG16F\u003e.Load linear: 198.553ms 0.966x\nTexture2D\u003cRG16F\u003e.Load random: 211.917ms 0.905x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 4.115ms 46.626x\nTexture2D\u003cRGBA16F\u003e.Load linear: 196.084ms 0.978x\nTexture2D\u003cRGBA16F\u003e.Load random: 350.561ms 0.547x\nTexture2D\u003cR32F\u003e.Load uniform: 3.517ms 54.547x\nTexture2D\u003cR32F\u003e.Load linear: 202.339ms 0.948x\nTexture2D\u003cR32F\u003e.Load random: 204.392ms 0.939x\nTexture2D\u003cRG32F\u003e.Load uniform: 3.705ms 51.783x\nTexture2D\u003cRG32F\u003e.Load linear: 198.537ms 0.966x\nTexture2D\u003cRG32F\u003e.Load random: 350.591ms 0.547x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 4.028ms 47.637x\nTexture2D\u003cRGBA32F\u003e.Load linear: 350.589ms 0.547x\nTexture2D\u003cRGBA32F\u003e.Load random: 350.519ms 0.547x\n```\n\n**NVIDIA Kepler** results (ratios) are identical to Maxwell \u0026 Pascal. See Maxwell for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between Kepler, Maxwell and Pascal.\n\n### NVidia Volta (Titan V)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 2.241ms 8.139x\nBuffer\u003cR8\u003e.Load linear: 14.806ms 1.232x\nBuffer\u003cR8\u003e.Load random: 16.514ms 1.104x\nBuffer\u003cRG8\u003e.Load uniform: 4.576ms 3.985x\nBuffer\u003cRG8\u003e.Load linear: 16.397ms 1.112x\nBuffer\u003cRG8\u003e.Load random: 16.707ms 1.092x\nBuffer\u003cRGBA8\u003e.Load uniform: 5.155ms 3.538x\nBuffer\u003cRGBA8\u003e.Load linear: 16.726ms 1.090x\nBuffer\u003cRGBA8\u003e.Load random: 18.236ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 2.807ms 6.497x\nBuffer\u003cR16f\u003e.Load linear: 14.771ms 1.235x\nBuffer\u003cR16f\u003e.Load random: 16.857ms 1.082x\nBuffer\u003cRG16f\u003e.Load uniform: 4.128ms 4.418x\nBuffer\u003cRG16f\u003e.Load linear: 16.155ms 1.129x\nBuffer\u003cRG16f\u003e.Load random: 15.140ms 1.205x\nBuffer\u003cRGBA16f\u003e.Load uniform: 4.747ms 3.841x\nBuffer\u003cRGBA16f\u003e.Load linear: 17.517ms 1.041x\nBuffer\u003cRGBA16f\u003e.Load random: 17.727ms 1.029x\nBuffer\u003cR32f\u003e.Load uniform: 2.630ms 6.935x\nBuffer\u003cR32f\u003e.Load linear: 17.341ms 1.052x\nBuffer\u003cR32f\u003e.Load random: 15.922ms 1.145x\nBuffer\u003cRG32f\u003e.Load uniform: 4.769ms 3.824x\nBuffer\u003cRG32f\u003e.Load linear: 15.745ms 1.158x\nBuffer\u003cRG32f\u003e.Load random: 15.801ms 1.154x\nBuffer\u003cRGBA32f\u003e.Load uniform: 4.772ms 3.822x\nBuffer\u003cRGBA32f\u003e.Load linear: 29.343ms 0.621x\nBuffer\u003cRGBA32f\u003e.Load random: 29.427ms 0.620x\nByteAddressBuffer.Load uniform: 8.948ms 2.038x\nByteAddressBuffer.Load linear: 8.722ms 2.091x\nByteAddressBuffer.Load random: 10.403ms 1.753x\nByteAddressBuffer.Load2 uniform: 10.132ms 1.800x\nByteAddressBuffer.Load2 linear: 11.406ms 1.599x\nByteAddressBuffer.Load2 random: 10.999ms 1.658x\nByteAddressBuffer.Load3 uniform: 12.638ms 1.443x\nByteAddressBuffer.Load3 linear: 13.708ms 1.330x\nByteAddressBuffer.Load3 random: 14.081ms 1.295x\nByteAddressBuffer.Load4 uniform: 15.421ms 1.183x\nByteAddressBuffer.Load4 linear: 26.412ms 0.690x\nByteAddressBuffer.Load4 random: 18.078ms 1.009x\nByteAddressBuffer.Load2 unaligned uniform: 11.076ms 1.647x\nByteAddressBuffer.Load2 unaligned linear: 11.474ms 1.589x\nByteAddressBuffer.Load2 unaligned random: 12.227ms 1.492x\nByteAddressBuffer.Load4 unaligned uniform: 15.817ms 1.153x\nByteAddressBuffer.Load4 unaligned linear: 25.894ms 0.704x\nByteAddressBuffer.Load4 unaligned random: 18.138ms 1.005x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 6.606ms 2.761x\nStructuredBuffer\u003cfloat\u003e.Load linear: 6.555ms 2.782x\nStructuredBuffer\u003cfloat\u003e.Load random: 9.063ms 2.012x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 8.332ms 2.189x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 8.545ms 2.134x\nStructuredBuffer\u003cfloat2\u003e.Load random: 7.271ms 2.508x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 8.890ms 2.051x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 9.650ms 1.890x\nStructuredBuffer\u003cfloat4\u003e.Load random: 9.677ms 1.885x\ncbuffer{float4} load uniform: 1.381ms 13.202x\ncbuffer{float4} load linear: 320.961ms 0.057x\ncbuffer{float4} load random: 150.072ms 0.122x\nTexture2D\u003cR8\u003e.Load uniform: 4.481ms 4.070x\nTexture2D\u003cR8\u003e.Load linear: 15.953ms 1.143x\nTexture2D\u003cR8\u003e.Load random: 15.058ms 1.211x\nTexture2D\u003cRG8\u003e.Load uniform: 4.594ms 3.970x\nTexture2D\u003cRG8\u003e.Load linear: 14.838ms 1.229x\nTexture2D\u003cRG8\u003e.Load random: 14.938ms 1.221x\nTexture2D\u003cRGBA8\u003e.Load uniform: 5.140ms 3.548x\nTexture2D\u003cRGBA8\u003e.Load linear: 14.915ms 1.223x\nTexture2D\u003cRGBA8\u003e.Load random: 15.031ms 1.213x\nTexture2D\u003cR16F\u003e.Load uniform: 5.748ms 3.173x\nTexture2D\u003cR16F\u003e.Load linear: 15.321ms 1.190x\nTexture2D\u003cR16F\u003e.Load random: 15.044ms 1.212x\nTexture2D\u003cRG16F\u003e.Load uniform: 4.609ms 3.957x\nTexture2D\u003cRG16F\u003e.Load linear: 14.918ms 1.222x\nTexture2D\u003cRG16F\u003e.Load random: 14.851ms 1.228x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 5.182ms 3.519x\nTexture2D\u003cRGBA16F\u003e.Load linear: 14.915ms 1.223x\nTexture2D\u003cRGBA16F\u003e.Load random: 29.841ms 0.611x\nTexture2D\u003cR32F\u003e.Load uniform: 4.462ms 4.087x\nTexture2D\u003cR32F\u003e.Load linear: 15.615ms 1.168x\nTexture2D\u003cR32F\u003e.Load random: 15.519ms 1.175x\nTexture2D\u003cRG32F\u003e.Load uniform: 4.585ms 3.977x\nTexture2D\u003cRG32F\u003e.Load linear: 16.651ms 1.095x\nTexture2D\u003cRG32F\u003e.Load random: 29.710ms 0.614x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 5.163ms 3.532x\nTexture2D\u003cRGBA32F\u003e.Load linear: 29.970ms 0.608x\nTexture2D\u003cRGBA32F\u003e.Load random: 29.358ms 0.621x\n```\n\n**NVIDIA Volta** results (ratios) of most common load/sample operations are identical to Pascal. However there are some huge changes raw load performance. Raw loads: 1d ~2x faster, 2d-4d ~4x faster (slightly more on 3d and 4d). Nvidia definitely seems to now use a faster direct memory path for raw loads. Raw loads are now the best choice on Nvidia hardware (which is a direct opposite of their last gen hardware). Independent studies of Volta architecture show that their raw load L1$ latency also dropped from 85 cycles (Pascal) down to 28 cycles (Volta). This should makes raw loads even more viable in real applications. My benchmark measures only throughput, so latency improvement isn't visible.\n\n**Uniform address optimization:** Uniform address optimization no longer affects StructuredBuffers. My educated guess is that StructuredBuffers (like raw buffers) now use the same lower latency direct memory path. Nvidia most likely hasn't yet implemented uniform address optimization for these new memory operations. Another curiosity is that Volta also has much lower performance advantage in the uniform address optimized cases (versus any other Nvidia GPU, including Turing).\n\n### NVidia Turing (RTX 2080 Ti)\n```\nBuffer\u003cR8\u003e.Load uniform: 0.703ms 23.287x\nBuffer\u003cR8\u003e.Load linear: 16.179ms 1.011x\nBuffer\u003cR8\u003e.Load random: 15.435ms 1.060x\nBuffer\u003cRG8\u003e.Load uniform: 0.881ms 18.567x\nBuffer\u003cRG8\u003e.Load linear: 15.983ms 1.024x\nBuffer\u003cRG8\u003e.Load random: 17.044ms 0.960x\nBuffer\u003cRGBA8\u003e.Load uniform: 1.336ms 12.247x\nBuffer\u003cRGBA8\u003e.Load linear: 16.825ms 0.973x\nBuffer\u003cRGBA8\u003e.Load random: 16.364ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 0.662ms 24.729x\nBuffer\u003cR16f\u003e.Load linear: 15.431ms 1.060x\nBuffer\u003cR16f\u003e.Load random: 15.916ms 1.028x\nBuffer\u003cRG16f\u003e.Load uniform: 0.870ms 18.811x\nBuffer\u003cRG16f\u003e.Load linear: 16.861ms 0.971x\nBuffer\u003cRG16f\u003e.Load random: 16.384ms 0.999x\nBuffer\u003cRGBA16f\u003e.Load uniform: 1.331ms 12.296x\nBuffer\u003cRGBA16f\u003e.Load linear: 15.892ms 1.030x\nBuffer\u003cRGBA16f\u003e.Load random: 15.949ms 1.026x\nBuffer\u003cR32f\u003e.Load uniform: 0.651ms 25.143x\nBuffer\u003cR32f\u003e.Load linear: 15.438ms 1.060x\nBuffer\u003cR32f\u003e.Load random: 16.851ms 0.971x\nBuffer\u003cRG32f\u003e.Load uniform: 1.369ms 11.953x\nBuffer\u003cRG32f\u003e.Load linear: 15.440ms 1.060x\nBuffer\u003cRG32f\u003e.Load random: 15.917ms 1.028x\nBuffer\u003cRGBA32f\u003e.Load uniform: 1.348ms 12.141x\nBuffer\u003cRGBA32f\u003e.Load linear: 30.695ms 0.533x\nBuffer\u003cRGBA32f\u003e.Load random: 32.514ms 0.503x\nByteAddressBuffer.Load uniform: 7.013ms 2.333x\nByteAddressBuffer.Load linear: 6.308ms 2.594x\nByteAddressBuffer.Load random: 7.347ms 2.227x\nByteAddressBuffer.Load2 uniform: 9.510ms 1.721x\nByteAddressBuffer.Load2 linear: 16.912ms 0.968x\nByteAddressBuffer.Load2 random: 9.715ms 1.684x\nByteAddressBuffer.Load3 uniform: 14.700ms 1.113x\nByteAddressBuffer.Load3 linear: 19.200ms 0.852x\nByteAddressBuffer.Load3 random: 14.804ms 1.105x\nByteAddressBuffer.Load4 uniform: 18.228ms 0.898x\nByteAddressBuffer.Load4 linear: 43.493ms 0.376x\nByteAddressBuffer.Load4 random: 32.616ms 0.502x\nByteAddressBuffer.Load2 unaligned uniform: 9.549ms 1.714x\nByteAddressBuffer.Load2 unaligned linear: 16.901ms 0.968x\nByteAddressBuffer.Load2 unaligned random: 9.719ms 1.684x\nByteAddressBuffer.Load4 unaligned uniform: 18.218ms 0.898x\nByteAddressBuffer.Load4 unaligned linear: 41.476ms 0.395x\nByteAddressBuffer.Load4 unaligned random: 32.081ms 0.510x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 6.535ms 2.504x\nStructuredBuffer\u003cfloat\u003e.Load linear: 6.706ms 2.440x\nStructuredBuffer\u003cfloat\u003e.Load random: 6.911ms 2.368x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 8.057ms 2.031x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 16.874ms 0.970x\nStructuredBuffer\u003cfloat2\u003e.Load random: 8.374ms 1.954x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 15.491ms 1.056x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 20.327ms 0.805x\nStructuredBuffer\u003cfloat4\u003e.Load random: 18.112ms 0.903x\ncbuffer{float4} load uniform: 0.834ms 19.616x\ncbuffer{float4} load linear: 328.935ms 0.050x\ncbuffer{float4} load random: 125.135ms 0.131x\nTexture2D\u003cR8\u003e.Load uniform: 0.746ms 21.929x\nTexture2D\u003cR8\u003e.Load linear: 16.173ms 1.012x\nTexture2D\u003cR8\u003e.Load random: 15.400ms 1.063x\nTexture2D\u003cRG8\u003e.Load uniform: 1.043ms 15.691x\nTexture2D\u003cRG8\u003e.Load linear: 15.421ms 1.061x\nTexture2D\u003cRG8\u003e.Load random: 15.400ms 1.063x\nTexture2D\u003cRGBA8\u003e.Load uniform: 1.876ms 8.725x\nTexture2D\u003cRGBA8\u003e.Load linear: 16.462ms 0.994x\nTexture2D\u003cRGBA8\u003e.Load random: 16.461ms 0.994x\nTexture2D\u003cR16F\u003e.Load uniform: 0.741ms 22.092x\nTexture2D\u003cR16F\u003e.Load linear: 16.253ms 1.007x\nTexture2D\u003cR16F\u003e.Load random: 16.222ms 1.009x\nTexture2D\u003cRG16F\u003e.Load uniform: 1.053ms 15.546x\nTexture2D\u003cRG16F\u003e.Load linear: 15.440ms 1.060x\nTexture2D\u003cRG16F\u003e.Load random: 16.148ms 1.013x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 1.890ms 8.659x\nTexture2D\u003cRGBA16F\u003e.Load linear: 16.125ms 1.015x\nTexture2D\u003cRGBA16F\u003e.Load random: 31.047ms 0.527x\nTexture2D\u003cR32F\u003e.Load uniform: 0.746ms 21.930x\nTexture2D\u003cR32F\u003e.Load linear: 16.403ms 0.998x\nTexture2D\u003cR32F\u003e.Load random: 16.638ms 0.983x\nTexture2D\u003cRG32F\u003e.Load uniform: 1.060ms 15.441x\nTexture2D\u003cRG32F\u003e.Load linear: 15.439ms 1.060x\nTexture2D\u003cRG32F\u003e.Load random: 31.903ms 0.513x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 1.888ms 8.668x\nTexture2D\u003cRGBA32F\u003e.Load linear: 31.525ms 0.519x\nTexture2D\u003cRGBA32F\u003e.Load random: 32.783ms 0.499x\n```\n\n**NVIDIA Turing** results (ratios) of most common load/sample operations are identical Volta. Except wide raw buffer load performance is closer to Maxwell/Pascal. In Volta, Nvidia used one large 128KB shared L1$ (freely configurable between groupshared mem and L1$), while in Turing they have 96KB shared L1$ which can be configured only as 64/32 or 32/64. This benchmark seems to point out that this halves their L1$ bandwidth for raw loads.\n\n**Uniform address optimization:** Like Volta, the new uniform address optimization no longer affects StructuredBuffers. My educated guess is that StructuredBuffers (like raw buffers) now use the same lower latency direct memory path. Nvidia most likely hasn't yet implemented uniform address optimization for these new memory operations. Turing uniform address optimization performance however (in other cases) returns to similar 20x+ figures than Maxwell/Pascal.\n\n### NVidia Ampere (RTX 3090)\n```\nBuffer\u003cR8\u003e.Load uniform: 0.691ms 15.067x\nBuffer\u003cR8\u003e.Load linear: 6.324ms 1.647x\nBuffer\u003cR8\u003e.Load random: 7.773ms 1.340x\nBuffer\u003cRG8\u003e.Load uniform: 0.717ms 14.529x\nBuffer\u003cRG8\u003e.Load linear: 6.334ms 1.644x\nBuffer\u003cRG8\u003e.Load random: 7.843ms 1.328x\nBuffer\u003cRGBA8\u003e.Load uniform: 0.842ms 12.372x\nBuffer\u003cRGBA8\u003e.Load linear: 7.419ms 1.404x\nBuffer\u003cRGBA8\u003e.Load random: 10.414ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 0.651ms 15.991x\nBuffer\u003cR16f\u003e.Load linear: 6.328ms 1.646x\nBuffer\u003cR16f\u003e.Load random: 6.824ms 1.526x\nBuffer\u003cRG16f\u003e.Load uniform: 0.722ms 14.426x\nBuffer\u003cRG16f\u003e.Load linear: 6.845ms 1.521x\nBuffer\u003cRG16f\u003e.Load random: 9.893ms 1.053x\nBuffer\u003cRGBA16f\u003e.Load uniform: 0.891ms 11.690x\nBuffer\u003cRGBA16f\u003e.Load linear: 7.490ms 1.390x\nBuffer\u003cRGBA16f\u003e.Load random: 7.536ms 1.382x\nBuffer\u003cR32f\u003e.Load uniform: 0.676ms 15.409x\nBuffer\u003cR32f\u003e.Load linear: 7.352ms 1.416x\nBuffer\u003cR32f\u003e.Load random: 9.929ms 1.049x\nBuffer\u003cRG32f\u003e.Load uniform: 0.767ms 13.578x\nBuffer\u003cRG32f\u003e.Load linear: 6.349ms 1.640x\nBuffer\u003cRG32f\u003e.Load random: 6.842ms 1.522x\nBuffer\u003cRGBA32f\u003e.Load uniform: 0.973ms 10.705x\nBuffer\u003cRGBA32f\u003e.Load linear: 14.504ms 0.718x\nBuffer\u003cRGBA32f\u003e.Load random: 13.037ms 0.799x\nByteAddressBuffer.Load uniform: 7.217ms 1.443x\nByteAddressBuffer.Load linear: 6.009ms 1.733x\nByteAddressBuffer.Load random: 5.433ms 1.917x\nByteAddressBuffer.Load2 uniform: 10.077ms 1.033x\nByteAddressBuffer.Load2 linear: 7.871ms 1.323x\nByteAddressBuffer.Load2 random: 7.259ms 1.435x\nByteAddressBuffer.Load3 uniform: 10.867ms 0.958x\nByteAddressBuffer.Load3 linear: 10.198ms 1.021x\nByteAddressBuffer.Load3 random: 10.597ms 0.983x\nByteAddressBuffer.Load4 uniform: 12.582ms 0.828x\nByteAddressBuffer.Load4 linear: 15.811ms 0.659x\nByteAddressBuffer.Load4 random: 12.665ms 0.822x\nByteAddressBuffer.Load2 unaligned uniform: 9.054ms 1.150x\nByteAddressBuffer.Load2 unaligned linear: 7.347ms 1.417x\nByteAddressBuffer.Load2 unaligned random: 7.258ms 1.435x\nByteAddressBuffer.Load4 unaligned uniform: 12.581ms 0.828x\nByteAddressBuffer.Load4 unaligned linear: 15.790ms 0.660x\nByteAddressBuffer.Load4 unaligned random: 12.666ms 0.822x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 5.889ms 1.768x\nStructuredBuffer\u003cfloat\u003e.Load linear: 4.689ms 2.221x\nStructuredBuffer\u003cfloat\u003e.Load random: 4.648ms 2.241x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 6.670ms 1.561x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 6.513ms 1.599x\nStructuredBuffer\u003cfloat2\u003e.Load random: 5.817ms 1.790x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 7.168ms 1.453x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 9.839ms 1.058x\nStructuredBuffer\u003cfloat4\u003e.Load random: 9.253ms 1.125x\ncbuffer{float4} load uniform: 1.126ms 9.245x\ncbuffer{float4} load linear: 280.222ms 0.037x\ncbuffer{float4} load random: 98.995ms 0.105x\nTexture2D\u003cR8\u003e.Load uniform: 0.676ms 15.409x\nTexture2D\u003cR8\u003e.Load linear: 6.335ms 1.644x\nTexture2D\u003cR8\u003e.Load random: 6.310ms 1.650x\nTexture2D\u003cRG8\u003e.Load uniform: 0.815ms 12.776x\nTexture2D\u003cRG8\u003e.Load linear: 6.338ms 1.643x\nTexture2D\u003cRG8\u003e.Load random: 6.324ms 1.647x\nTexture2D\u003cRGBA8\u003e.Load uniform: 0.973ms 10.705x\nTexture2D\u003cRGBA8\u003e.Load linear: 9.430ms 1.104x\nTexture2D\u003cRGBA8\u003e.Load random: 12.498ms 0.833x\nTexture2D\u003cR16F\u003e.Load uniform: 0.709ms 14.697x\nTexture2D\u003cR16F\u003e.Load linear: 6.337ms 1.644x\nTexture2D\u003cR16F\u003e.Load random: 6.314ms 1.649x\nTexture2D\u003cRG16F\u003e.Load uniform: 0.778ms 13.382x\nTexture2D\u003cRG16F\u003e.Load linear: 9.417ms 1.106x\nTexture2D\u003cRG16F\u003e.Load random: 12.493ms 0.834x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 1.024ms 10.170x\nTexture2D\u003cRGBA16F\u003e.Load linear: 17.148ms 0.607x\nTexture2D\u003cRGBA16F\u003e.Load random: 25.050ms 0.416x\nTexture2D\u003cR32F\u003e.Load uniform: 0.740ms 14.066x\nTexture2D\u003cR32F\u003e.Load linear: 9.774ms 1.065x\nTexture2D\u003cR32F\u003e.Load random: 12.493ms 0.834x\nTexture2D\u003cRG32F\u003e.Load uniform: 0.863ms 12.064x\nTexture2D\u003cRG32F\u003e.Load linear: 17.484ms 0.596x\nTexture2D\u003cRG32F\u003e.Load random: 25.180ms 0.414x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 1.176ms 8.859x\nTexture2D\u003cRGBA32F\u003e.Load linear: 25.574ms 0.407x\nTexture2D\u003cRGBA32F\u003e.Load random: 25.952ms 0.401x\nTexture2D\u003cR8\u003e.Sample(nearest) uniform: 12.506ms 0.833x\nTexture2D\u003cR8\u003e.Sample(nearest) linear: 12.513ms 0.832x\nTexture2D\u003cR8\u003e.Sample(nearest) random: 13.423ms 0.776x\nTexture2D\u003cRG8\u003e.Sample(nearest) uniform: 12.867ms 0.809x\nTexture2D\u003cRG8\u003e.Sample(nearest) linear: 12.884ms 0.808x\nTexture2D\u003cRG8\u003e.Sample(nearest) random: 15.190ms 0.686x\nTexture2D\u003cRGBA8\u003e.Sample(nearest) uniform: 13.018ms 0.800x\nTexture2D\u003cRGBA8\u003e.Sample(nearest) linear: 12.530ms 0.831x\nTexture2D\u003cRGBA8\u003e.Sample(nearest) random: 13.568ms 0.768x\nTexture2D\u003cR16F\u003e.Sample(nearest) uniform: 13.230ms 0.787x\nTexture2D\u003cR16F\u003e.Sample(nearest) linear: 12.514ms 0.832x\nTexture2D\u003cR16F\u003e.Sample(nearest) random: 14.266ms 0.730x\nTexture2D\u003cRG16F\u003e.Sample(nearest) uniform: 13.395ms 0.777x\nTexture2D\u003cRG16F\u003e.Sample(nearest) linear: 13.051ms 0.798x\nTexture2D\u003cRG16F\u003e.Sample(nearest) random: 13.401ms 0.777x\nTexture2D\u003cRGBA16F\u003e.Sample(nearest) uniform: 13.421ms 0.776x\nTexture2D\u003cRGBA16F\u003e.Sample(nearest) linear: 12.902ms 0.807x\nTexture2D\u003cRGBA16F\u003e.Sample(nearest) random: 26.066ms 0.400x\nTexture2D\u003cR32F\u003e.Sample(nearest) uniform: 12.870ms 0.809x\nTexture2D\u003cR32F\u003e.Sample(nearest) linear: 13.069ms 0.797x\nTexture2D\u003cR32F\u003e.Sample(nearest) random: 12.508ms 0.833x\nTexture2D\u003cRG32F\u003e.Sample(nearest) uniform: 13.945ms 0.747x\nTexture2D\u003cRG32F\u003e.Sample(nearest) linear: 12.524ms 0.832x\nTexture2D\u003cRG32F\u003e.Sample(nearest) random: 26.621ms 0.391x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) uniform: 26.447ms 0.394x\nTexture2D\u003cRGBA32F\u003e.Sample(nearest) linear: 26.258ms 0.397x\nTexture2D\u003cRGBA32F\u003e.Sample(nearest) random: 25.757ms 0.404x\nTexture2D\u003cR8\u003e.Sample(bilinear) uniform: 12.508ms 0.833x\nTexture2D\u003cR8\u003e.Sample(bilinear) linear: 13.029ms 0.799x\nTexture2D\u003cR8\u003e.Sample(bilinear) random: 12.507ms 0.833x\nTexture2D\u003cRG8\u003e.Sample(bilinear) uniform: 12.510ms 0.832x\nTexture2D\u003cRG8\u003e.Sample(bilinear) linear: 13.034ms 0.799x\nTexture2D\u003cRG8\u003e.Sample(bilinear) random: 13.032ms 0.799x\nTexture2D\u003cRGBA8\u003e.Sample(bilinear) uniform: 12.514ms 0.832x\nTexture2D\u003cRGBA8\u003e.Sample(bilinear) linear: 13.060ms 0.797x\nTexture2D\u003cRGBA8\u003e.Sample(bilinear) random: 12.520ms 0.832x\nTexture2D\u003cR16F\u003e.Sample(bilinear) uniform: 12.507ms 0.833x\nTexture2D\u003cR16F\u003e.Sample(bilinear) linear: 12.514ms 0.832x\nTexture2D\u003cR16F\u003e.Sample(bilinear) random: 12.509ms 0.833x\nTexture2D\u003cRG16F\u003e.Sample(bilinear) uniform: 13.034ms 0.799x\nTexture2D\u003cRG16F\u003e.Sample(bilinear) linear: 12.516ms 0.832x\nTexture2D\u003cRG16F\u003e.Sample(bilinear) random: 12.522ms 0.832x\nTexture2D\u003cRGBA16F\u003e.Sample(bilinear) uniform: 12.508ms 0.833x\nTexture2D\u003cRGBA16F\u003e.Sample(bilinear) linear: 12.533ms 0.831x\nTexture2D\u003cRGBA16F\u003e.Sample(bilinear) random: 24.840ms 0.419x\nTexture2D\u003cR32F\u003e.Sample(bilinear) uniform: 12.507ms 0.833x\nTexture2D\u003cR32F\u003e.Sample(bilinear) linear: 12.516ms 0.832x\nTexture2D\u003cR32F\u003e.Sample(bilinear) random: 12.510ms 0.832x\nTexture2D\u003cRG32F\u003e.Sample(bilinear) uniform: 12.510ms 0.832x\nTexture2D\u003cRG32F\u003e.Sample(bilinear) linear: 12.526ms 0.831x\nTexture2D\u003cRG32F\u003e.Sample(bilinear) random: 24.839ms 0.419x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) uniform: 49.561ms 0.210x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) linear: 49.592ms 0.210x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) random: 74.230ms 0.140x\n```\n\n**NVIDIA Ampere** results (ratios) of most common load/sample operations look similar to Turing.\n\n**Sampler ratios (NEW!):** New tests for sampler ratios show that Ampere has half rate bilinear RG32F and quarter rate bilinear RGBA32F. Nearest filtering is full rate, except for RGBA32F which is half rate (similar to RGBA32F texture loads). In Turing and Ampere RGBA32/float4 buffer loads are full rate.\n\n### Intel Gen9 (HD 630 / i7 6700K)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 48.527ms 5.955x\nBuffer\u003cR8\u003e.Load linear: 243.487ms 1.187x\nBuffer\u003cR8\u003e.Load random: 286.351ms 1.009x\nBuffer\u003cRG8\u003e.Load uniform: 49.022ms 5.895x\nBuffer\u003cRG8\u003e.Load linear: 242.316ms 1.193x\nBuffer\u003cRG8\u003e.Load random: 288.927ms 1.000x\nBuffer\u003cRGBA8\u003e.Load uniform: 48.962ms 5.902x\nBuffer\u003cRGBA8\u003e.Load linear: 244.140ms 1.184x\nBuffer\u003cRGBA8\u003e.Load random: 288.981ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 49.989ms 5.781x\nBuffer\u003cR16f\u003e.Load linear: 242.649ms 1.191x\nBuffer\u003cR16f\u003e.Load random: 287.790ms 1.004x\nBuffer\u003cRG16f\u003e.Load uniform: 48.921ms 5.907x\nBuffer\u003cRG16f\u003e.Load linear: 243.826ms 1.185x\nBuffer\u003cRG16f\u003e.Load random: 286.305ms 1.009x\nBuffer\u003cRGBA16f\u003e.Load uniform: 48.855ms 5.915x\nBuffer\u003cRGBA16f\u003e.Load linear: 242.278ms 1.193x\nBuffer\u003cRGBA16f\u003e.Load random: 288.235ms 1.003x\nBuffer\u003cR32f\u003e.Load uniform: 49.272ms 5.865x\nBuffer\u003cR32f\u003e.Load linear: 241.286ms 1.198x\nBuffer\u003cR32f\u003e.Load random: 286.946ms 1.007x\nBuffer\u003cRG32f\u003e.Load uniform: 48.587ms 5.948x\nBuffer\u003cRG32f\u003e.Load linear: 242.442ms 1.192x\nBuffer\u003cRG32f\u003e.Load random: 287.429ms 1.005x\nBuffer\u003cRGBA32f\u003e.Load uniform: 48.562ms 5.951x\nBuffer\u003cRGBA32f\u003e.Load linear: 241.818ms 1.195x\nBuffer\u003cRGBA32f\u003e.Load random: 287.268ms 1.006x\nByteAddressBuffer.Load uniform: 15.647ms 18.469x\nByteAddressBuffer.Load linear: 49.962ms 5.784x\nByteAddressBuffer.Load random: 51.418ms 5.620x\nByteAddressBuffer.Load2 uniform: 13.941ms 20.728x\nByteAddressBuffer.Load2 linear: 93.546ms 3.089x\nByteAddressBuffer.Load2 random: 140.016ms 2.064x\nByteAddressBuffer.Load3 uniform: 19.754ms 14.629x\nByteAddressBuffer.Load3 linear: 168.581ms 1.714x\nByteAddressBuffer.Load3 random: 312.721ms 0.924x\nByteAddressBuffer.Load4 uniform: 13.932ms 20.743x\nByteAddressBuffer.Load4 linear: 175.224ms 1.649x\nByteAddressBuffer.Load4 random: 340.677ms 0.848x\nByteAddressBuffer.Load2 unaligned uniform: 15.152ms 19.072x\nByteAddressBuffer.Load2 unaligned linear: 99.901ms 2.893x\nByteAddressBuffer.Load2 unaligned random: 145.827ms 1.982x\nByteAddressBuffer.Load4 unaligned uniform: 16.249ms 17.784x\nByteAddressBuffer.Load4 unaligned linear: 199.205ms 1.451x\nByteAddressBuffer.Load4 unaligned random: 378.326ms 0.764x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 14.309ms 20.195x\nStructuredBuffer\u003cfloat\u003e.Load linear: 50.181ms 5.759x\nStructuredBuffer\u003cfloat\u003e.Load random: 51.750ms 5.584x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 13.856ms 20.856x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 94.388ms 3.062x\nStructuredBuffer\u003cfloat2\u003e.Load random: 141.301ms 2.045x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 13.493ms 21.417x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 175.457ms 1.647x\nStructuredBuffer\u003cfloat4\u003e.Load random: 340.806ms 0.848x\ncbuffer{float4} load uniform: 13.443ms 21.497x\ncbuffer{float4} load linear: 242.860ms 1.190x\ncbuffer{float4} load random: 285.850ms 1.011x\nTexture2D\u003cR8\u003e.Load uniform: 24.519ms 11.786x\nTexture2D\u003cR8\u003e.Load linear: 97.392ms 2.967x\nTexture2D\u003cR8\u003e.Load random: 97.824ms 2.954x\nTexture2D\u003cRG8\u003e.Load uniform: 24.376ms 11.855x\nTexture2D\u003cRG8\u003e.Load linear: 97.068ms 2.977x\nTexture2D\u003cRG8\u003e.Load random: 97.767ms 2.956x\nTexture2D\u003cRGBA8\u003e.Load uniform: 24.509ms 11.791x\nTexture2D\u003cRGBA8\u003e.Load linear: 101.171ms 2.856x\nTexture2D\u003cRGBA8\u003e.Load random: 101.069ms 2.859x\nTexture2D\u003cR16F\u003e.Load uniform: 24.874ms 11.618x\nTexture2D\u003cR16F\u003e.Load linear: 97.947ms 2.950x\nTexture2D\u003cR16F\u003e.Load random: 97.385ms 2.967x\nTexture2D\u003cRG16F\u003e.Load uniform: 24.324ms 11.881x\nTexture2D\u003cRG16F\u003e.Load linear: 98.257ms 2.941x\nTexture2D\u003cRG16F\u003e.Load random: 97.672ms 2.959x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 24.408ms 11.840x\nTexture2D\u003cRGBA16F\u003e.Load linear: 101.515ms 2.847x\nTexture2D\u003cRGBA16F\u003e.Load random: 195.229ms 1.480x\nTexture2D\u003cR32F\u003e.Load uniform: 24.677ms 11.710x\nTexture2D\u003cR32F\u003e.Load linear: 97.829ms 2.954x\nTexture2D\u003cR32F\u003e.Load random: 97.614ms 2.960x\nTexture2D\u003cRG32F\u003e.Load uniform: 24.859ms 11.625x\nTexture2D\u003cRG32F\u003e.Load linear: 97.809ms 2.955x\nTexture2D\u003cRG32F\u003e.Load random: 194.397ms 1.487x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 24.660ms 11.719x\nTexture2D\u003cRGBA32F\u003e.Load linear: 243.432ms 1.187x\nTexture2D\u003cRGBA32F\u003e.Load random: 195.579ms 1.478x\n \nNOTE: Intel result not directly comparable with other GPUs. I had to reduce workload size to avoid TDR.\n```\n\n**Typed loads:** All typed loads have same identical performance. Dimensions (1d/2d/4d) and channel widths (8b/16b/32b) don't affect performance. Intel compiler has a fast path for uniform address loads. It improves performance by up to 6x. Linear typed loads do not coalesce. Best bytes per cycle rate can be achieved by widest RGBA32 loads.\n\n**Raw (ByteAddressBuffer) loads:** Intel raw buffer loads are significantly faster compared to similar typed loads. 1d raw load is 5x faster than any typed load. 2d linear raw load is 2.5x faster than typed loads. 4d linear raw load is 40% faster than typed loads. 2d/4d random raw loads are around 2x slower compared to linear ones (could be coalescing or something else). 3d raw load performance matches 4d. Alignment doesn't seem to matter. Uniform address raw loads also use the same compiler fast path as typed loads (6x gain).\n\n**Structured buffer loads:** Performance is identical to similar raw buffer loads.\n\n**Cbuffer loads:** 22x faster than normal load for fully uniform address. Linear/random access performance identical to typed buffer loads. Raw/structured buffers are up to 2x faster if you have linear/random access pattern.\n\n**Texture loads:** All formats perform similarly, except the widest RGBA32 (half speed linear). Uniform address texture loads are 4x faster than linear. There's certainly something fishy going on, as Texture2D loads are generally 2x+ faster than same format buffer loads. Maybe I am hitting some bank conflict case or Intel is swizzling the buffer layout.\n\n**Suggestions:** When using typed buffers, prefer widest loads (RGBA32). Raw buffers are significantly faster than typed buffers. \n\n**Uniform address optimization:** Uniform address loads are very fast (both raw and typed). Intel has confirmed that their compiler uses a wave shuffle trick to speed up uniform loads inside loops. See \"Uniform Address Load Investigation\" chapter for more info. \n\n\n### Intel Gen11 (Iris Plus / i7-1065g7)\n```markdown\nBuffer\u003cR8\u003e.Load uniform: 58.213ms 11.630x\nBuffer\u003cR8\u003e.Load linear: 591.619ms 1.144x\nBuffer\u003cR8\u003e.Load random: 699.948ms 0.967x\nBuffer\u003cRG8\u003e.Load uniform: 59.530ms 11.373x\nBuffer\u003cRG8\u003e.Load linear: 598.979ms 1.130x\nBuffer\u003cRG8\u003e.Load random: 678.129ms 0.998x\nBuffer\u003cRGBA8\u003e.Load uniform: 59.296ms 11.418x\nBuffer\u003cRGBA8\u003e.Load linear: 571.312ms 1.185x\nBuffer\u003cRGBA8\u003e.Load random: 677.040ms 1.000x\nBuffer\u003cR16f\u003e.Load uniform: 58.108ms 11.651x\nBuffer\u003cR16f\u003e.Load linear: 571.071ms 1.186x\nBuffer\u003cR16f\u003e.Load random: 677.930ms 0.999x\nBuffer\u003cRG16f\u003e.Load uniform: 58.052ms 11.663x\nBuffer\u003cRG16f\u003e.Load linear: 575.332ms 1.177x\nBuffer\u003cRG16f\u003e.Load random: 675.883ms 1.002x\nBuffer\u003cRGBA16f\u003e.Load uniform: 58.724ms 11.529x\nBuffer\u003cRGBA16f\u003e.Load linear: 571.145ms 1.185x\nBuffer\u003cRGBA16f\u003e.Load random: 676.597ms 1.001x\nBuffer\u003cR32f\u003e.Load uniform: 57.779ms 11.718x\nBuffer\u003cR32f\u003e.Load linear: 570.898ms 1.186x\nBuffer\u003cR32f\u003e.Load random: 676.160ms 1.001x\nBuffer\u003cRG32f\u003e.Load uniform: 57.770ms 11.720x\nBuffer\u003cRG32f\u003e.Load linear: 571.226ms 1.185x\nBuffer\u003cRG32f\u003e.Load random: 677.745ms 0.999x\nBuffer\u003cRGBA32f\u003e.Load uniform: 58.759ms 11.522x\nBuffer\u003cRGBA32f\u003e.Load linear: 571.372ms 1.185x\nBuffer\u003cRGBA32f\u003e.Load random: 676.695ms 1.001x\nByteAddressBuffer.Load uniform: 98.943ms 6.843x\nByteAddressBuffer.Load linear: 254.749ms 2.658x\nByteAddressBuffer.Load random: 378.516ms 1.789x\nByteAddressBuffer.Load2 uniform: 68.931ms 9.822x\nByteAddressBuffer.Load2 linear: 456.746ms 1.482x\nByteAddressBuffer.Load2 random: 762.950ms 0.887x\nByteAddressBuffer.Load3 uniform: 77.403ms 8.747x\nByteAddressBuffer.Load3 linear: 839.961ms 0.806x\nByteAddressBuffer.Load3 random: 1706.975ms 0.397x\nByteAddressBuffer.Load4 uniform: 63.715ms 10.626x\nByteAddressBuffer.Load4 linear: 868.385ms 0.780x\nByteAddressBuffer.Load4 random: 1796.999ms 0.377x\nByteAddressBuffer.Load2 unaligned uniform: 78.052ms 8.674x\nByteAddressBuffer.Load2 unaligned linear: 487.732ms 1.388x\nByteAddressBuffer.Load2 unaligned random: 787.569ms 0.860x\nByteAddressBuffer.Load4 unaligned uniform: 79.889ms 8.475x\nByteAddressBuffer.Load4 unaligned linear: 995.681ms 0.680x\nByteAddressBuffer.Load4 unaligned random: 2015.342ms 0.336x\nStructuredBuffer\u003cfloat\u003e.Load uniform: 100.075ms 6.765x\nStructuredBuffer\u003cfloat\u003e.Load linear: 251.827ms 2.689x\nStructuredBuffer\u003cfloat\u003e.Load random: 366.612ms 1.847x\nStructuredBuffer\u003cfloat2\u003e.Load uniform: 69.021ms 9.809x\nStructuredBuffer\u003cfloat2\u003e.Load linear: 447.962ms 1.511x\nStructuredBuffer\u003cfloat2\u003e.Load random: 741.070ms 0.914x\nStructuredBuffer\u003cfloat4\u003e.Load uniform: 62.209ms 10.883x\nStructuredBuffer\u003cfloat4\u003e.Load linear: 868.643ms 0.779x\nStructuredBuffer\u003cfloat4\u003e.Load random: 1798.563ms 0.376x\ncbuffer{float4} load uniform: 63.908ms 10.594x\ncbuffer{float4} load linear: 859.170ms 0.788x\ncbuffer{float4} load random: 1815.643ms 0.373x\nTexture2D\u003cR8\u003e.Load uniform: 57.693ms 11.735x\nTexture2D\u003cR8\u003e.Load linear: 229.955ms 2.944x\nTexture2D\u003cR8\u003e.Load random: 230.291ms 2.940x\nTexture2D\u003cRG8\u003e.Load uniform: 57.835ms 11.706x\nTexture2D\u003cRG8\u003e.Load linear: 230.142ms 2.942x\nTexture2D\u003cRG8\u003e.Load random: 230.195ms 2.941x\nTexture2D\u003cRGBA8\u003e.Load uniform: 58.916ms 11.492x\nTexture2D\u003cRGBA8\u003e.Load linear: 230.623ms 2.936x\nTexture2D\u003cRGBA8\u003e.Load random: 230.788ms 2.934x\nTexture2D\u003cR16F\u003e.Load uniform: 60.521ms 11.187x\nTexture2D\u003cR16F\u003e.Load linear: 229.671ms 2.948x\nTexture2D\u003cR16F\u003e.Load random: 229.764ms 2.947x\nTexture2D\u003cRG16F\u003e.Load uniform: 57.673ms 11.739x\nTexture2D\u003cRG16F\u003e.Load linear: 230.141ms 2.942x\nTexture2D\u003cRG16F\u003e.Load random: 230.311ms 2.940x\nTexture2D\u003cRGBA16F\u003e.Load uniform: 58.287ms 11.616x\nTexture2D\u003cRGBA16F\u003e.Load linear: 230.076ms 2.943x\nTexture2D\u003cRGBA16F\u003e.Load random: 459.294ms 1.474x\nTexture2D\u003cR32F\u003e.Load uniform: 57.614ms 11.751x\nTexture2D\u003cR32F\u003e.Load linear: 234.894ms 2.882x\nTexture2D\u003cR32F\u003e.Load random: 229.711ms 2.947x\nTexture2D\u003cRG32F\u003e.Load uniform: 57.674ms 11.739x\nTexture2D\u003cRG32F\u003e.Load linear: 230.058ms 2.943x\nTexture2D\u003cRG32F\u003e.Load random: 459.470ms 1.474x\nTexture2D\u003cRGBA32F\u003e.Load uniform: 58.323ms 11.608x\nTexture2D\u003cRGBA32F\u003e.Load linear: 573.734ms 1.180x\nTexture2D\u003cRGBA32F\u003e.Load random: 459.263ms 1.474x\nTexture2D\u003cR8\u003e.Sample(nearest) uniform: 229.959ms 2.944x\nTexture2D\u003cR8\u003e.Sample(nearest) linear: 229.704ms 2.947x\nTexture2D\u003cR8\u003e.Sample(nearest) random: 232.381ms 2.913x\nTexture2D\u003cRG8\u003e.Sample(nearest) uniform: 230.455ms 2.938x\nTexture2D\u003cRG8\u003e.Sample(nearest) linear: 230.214ms 2.941x\nTexture2D\u003cRG8\u003e.Sample(nearest) random: 230.276ms 2.940x\nTexture2D\u003cRGBA8\u003e.Sample(nearest) uniform: 231.480ms 2.925x\nTexture2D\u003cRGBA8\u003e.Sample(nearest) linear: 231.062ms 2.930x\nTexture2D\u003cRGBA8\u003e.Sample(nearest) random: 240.466ms 2.816x\nTexture2D\u003cR16F\u003e.Sample(nearest) uniform: 229.897ms 2.945x\nTexture2D\u003cR16F\u003e.Sample(nearest) linear: 231.593ms 2.923x\nTexture2D\u003cR16F\u003e.Sample(nearest) random: 229.666ms 2.948x\nTexture2D\u003cRG16F\u003e.Sample(nearest) uniform: 230.391ms 2.939x\nTexture2D\u003cRG16F\u003e.Sample(nearest) linear: 230.027ms 2.943x\nTexture2D\u003cRG16F\u003e.Sample(nearest) random: 230.098ms 2.942x\nTexture2D\u003cRGBA16F\u003e.Sample(nearest) uniform: 230.577ms 2.936x\nTexture2D\u003cRGBA16F\u003e.Sample(nearest) linear: 230.086ms 2.943x\nTexture2D\u003cRGBA16F\u003e.Sample(nearest) random: 459.828ms 1.472x\nTexture2D\u003cR32F\u003e.Sample(nearest) uniform: 229.827ms 2.946x\nTexture2D\u003cR32F\u003e.Sample(nearest) linear: 231.692ms 2.922x\nTexture2D\u003cR32F\u003e.Sample(nearest) random: 229.751ms 2.947x\nTexture2D\u003cRG32F\u003e.Sample(nearest) uniform: 230.528ms 2.937x\nTexture2D\u003cRG32F\u003e.Sample(nearest) linear: 230.021ms 2.943x\nTexture2D\u003cRG32F\u003e.Sample(nearest) random: 460.311ms 1.471x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) uniform: 230.903ms 2.932x\nTexture2D\u003cRGBA32F\u003e.Sample(nearest) linear: 573.964ms 1.180x\nTexture2D\u003cRGBA32F\u003e.Sample(nearest) random: 460.377ms 1.471x\nTexture2D\u003cR8\u003e.Sample(bilinear) uniform: 229.860ms 2.945x\nTexture2D\u003cR8\u003e.Sample(bilinear) linear: 229.663ms 2.948x\nTexture2D\u003cR8\u003e.Sample(bilinear) random: 229.689ms 2.948x\nTexture2D\u003cRG8\u003e.Sample(bilinear) uniform: 230.469ms 2.938x\nTexture2D\u003cRG8\u003e.Sample(bilinear) linear: 230.000ms 2.944x\nTexture2D\u003cRG8\u003e.Sample(bilinear) random: 230.095ms 2.942x\nTexture2D\u003cRGBA8\u003e.Sample(bilinear) uniform: 230.668ms 2.935x\nTexture2D\u003cRGBA8\u003e.Sample(bilinear) linear: 230.157ms 2.942x\nTexture2D\u003cRGBA8\u003e.Sample(bilinear) random: 240.543ms 2.815x\nTexture2D\u003cR16F\u003e.Sample(bilinear) uniform: 229.871ms 2.945x\nTexture2D\u003cR16F\u003e.Sample(bilinear) linear: 229.663ms 2.948x\nTexture2D\u003cR16F\u003e.Sample(bilinear) random: 229.777ms 2.947x\nTexture2D\u003cRG16F\u003e.Sample(bilinear) uniform: 230.454ms 2.938x\nTexture2D\u003cRG16F\u003e.Sample(bilinear) linear: 230.009ms 2.944x\nTexture2D\u003cRG16F\u003e.Sample(bilinear) random: 234.252ms 2.890x\nTexture2D\u003cRGBA16F\u003e.Sample(bilinear) uniform: 230.580ms 2.936x\nTexture2D\u003cRGBA16F\u003e.Sample(bilinear) linear: 344.679ms 1.964x\nTexture2D\u003cRGBA16F\u003e.Sample(bilinear) random: 460.104ms 1.471x\nTexture2D\u003cR32F\u003e.Sample(bilinear) uniform: 230.189ms 2.941x\nTexture2D\u003cR32F\u003e.Sample(bilinear) linear: 229.679ms 2.948x\nTexture2D\u003cR32F\u003e.Sample(bilinear) random: 229.726ms 2.947x\nTexture2D\u003cRG32F\u003e.Sample(bilinear) uniform: 460.443ms 1.470x\nTexture2D\u003cRG32F\u003e.Sample(bilinear) linear: 459.809ms 1.472x\nTexture2D\u003cRG32F\u003e.Sample(bilinear) random: 689.533ms 0.982x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) uniform: 919.711ms 0.736x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) linear: 918.780ms 0.737x\nTexture2D\u003cRGBA32F\u003e.Sample(bilinear) random: 919.250ms 0.737x\n```\n\n**Intel Gen11** results (ratios) of most common load/sample operations look similar to Gen 9, except raw loads are no longer 5x faster, instead they are 2.5x faster. The uniform address speedup seems to be around 2x higher. Maybe SIMD16 mode used instead of SIMD8?\n\n**Sampler ratios (NEW!):** New tests for sampler ratios show that Gen11 has half rate bilinear RG32F and quarter rate bilinear RGBA32F. Nearest filtering is full rate, except for RGBA32F which is half rate (similar to RGBA32F texture loads). \n\n## Contact\n\nSend private message to @SebAaltonen at Twitter. We can discuss via company emails later.\n\n## License\n\nPerfTest is released under the MIT license. See [LICENSE.md](LICENSE.md) for full text.\n","funding_links":[],"categories":["Game-BenchMark/Metric/Tool","C++"],"sub_categories":["Google Analytics"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebbbi%2Fperftest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsebbbi%2Fperftest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebbbi%2Fperftest/lists"}