Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sebbbi/perftest
GPU texture/buffer performance tester
https://github.com/sebbbi/perftest
Last synced: 3 months ago
JSON representation
GPU texture/buffer performance tester
- Host: GitHub
- URL: https://github.com/sebbbi/perftest
- Owner: sebbbi
- License: mit
- Created: 2016-12-04T18:32:02.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2020-11-19T12:59:46.000Z (almost 4 years ago)
- Last Synced: 2024-05-02T16:35:48.334Z (6 months ago)
- Language: C++
- Homepage:
- Size: 192 KB
- Stars: 482
- Watchers: 28
- Forks: 25
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
- AwesomeCppGameDev - perftest
README
# PerfTest
A simple GPU shader memory operation performance test tool. Current implementation is DirectX 11.0 based.
The purpose of this application is not to benchmark different brand GPUs against each other. Its purpose is to help rendering programmers to choose right types of resources when optimizing their compute shader performance.
This application is designed to measure peak data load performance from L1 caches. I tried to avoid known hardware bottlenecks. **If you notice something wrong or suspicious in the shader workload, please inform me immediately and I will fix it.** If my shaders are affected by some hardware bottlenecks, I am glad to hear about it and write more test cases to show the best performance. The goal is that developers gain better understanding of various GPU hardware on the market and gain insight to optimize code for them.
## Features
Designed to measure performance of various types of buffer and image loads. This application is not a GPU memory bandwidth measurement tool. All tests operate inside GPUs L1 caches (no larger than 16 KB working sets).
- Coalesced loads (100% L1 cache hit)
- Random loads (100% L1 cache hit)
- Uniform address loads (same address for all threads)
- Typed Buffer SRVs: 1/2/4 channels, 8/16/32 bits per channel
- ByteAddressBuffer SRVs: load, load2, load3, load4 - aligned and unaligned
- Structured Buffer SRVs: float/float2/float4
- Constant Buffer float4 array indexed loads
- Texture2D loads: 1/2/4 channels, 8/16/32 bits per channel
- Texture2D nearest sampling: 1/2/4 channels, 8/16/32 bits per channel
- Texture2D bilinear sampling: 1/2/4 channels, 8/16/32 bits per channel## Explanations
**Coalesced loads:**
GPUs optimize linear address patterns. Coalescing occurs when all threads in a warp/wave (32/64 threads) load from contiguous addresses. In my "linear" test case, memory loads access contiguous addresses in the whole thread group (256 threads). This should coalesce perfectly on all GPUs, independent of warp/wave width.**Random loads:**
I add a random start offset of 0-15 elements for each thread (still aligned). This prevents GPU coalescing, and provides more realistic view of performance for common case (non-linear) memory accessing. This benchmark is as cache efficient as the previous. All data still comes from the L1 cache.**Uniform loads:**
All threads in group simultaneously load from the same address. This triggers coalesced path on some GPUs and additonal optimizations on some GPUs, such as scalar loads (SGPR storage) on AMD GCN. I have noticed that recent Intel and Nvidia drivers also implement a software optimization for uniform load loop case (which is employed by this benchmark).**Notes:**
**Compiler optimizations** can ruin the results. We want to measure only load (read) performance, but write (store) is also needed, otherwise the compiler will just optimize the whole shader away. To avoid this, each thread does first 256 loads followed by a single linear groupshared memory write (no bank-conflicts). Cbuffer contains a write mask (not known at compile time). It controls which elements are written from the groupshared memory to the output buffer. The mask is always zero at runtime. Compilers can also combine multiple narrow raw buffer loads together (as bigger 4d loads) if it an be proven at compile time that loads from the same thread access contiguous offsets. This is prevented by applying an address mask from cbuffer (not known at compile time).## Uniform Load Investigation
When I first implemented this benchmark, I noticed that Intel uniform address loads were surprisingly fast. Intel ISA documents don't mention anything about a scalar unit or other hardware feature to make uniform address loads fast. This optimization affected every single resource type, unlike AMDs hardware scalar unit (which only works for raw data loads). I didnt't investigate this further however at that point. When Nvidia released Volta GPUs, they brought new driver that implemented similar compiler optimization. Later drivers introduced the same optimization to Maxwell and Pascal too. And now Turing also has it. It's certainly not hardware based, since 20x+ gains apply to all their existing GPUs too.In Nov 10-11 weekend (2018) I was toying around with Vulkan/DX12 wave intrinsics, and came up with a crazy idea to use a single wave wide load and then use wave intrinsics to broadcast scalar result (single lane) to each loop iteration. This results in up to wave width reduced amount of loads.
See the gist and Shader Playground links here:
https://gist.github.com/sebbbi/ba4415339b535d22fb18e2d824564ec4In Nvidia's uniform load optimization case, their wave width = 32, and their uniform load optimization performance boost is up to 28x. This finding really made me curious. Could Nvidia implement a similar warp shuffle based optimization for this use case? The funny thing is that my tweets escalated the situation, and made Intel reveal their hand:
https://twitter.com/JoshuaBarczak/status/1062060067334189056
Intel has now officially revealed that their driver does a wave shuffle optimization for uniform address loads. They have been doing it for years already. This explains Intel GPU benchmark results perfectly. Now that we have confirmation of Intel's (original) optimization, I suspect that Nvidia's shader compiler employs a highly similar optimization in this case. Both optimizations are great, because Nvidia/Intel do not have a dedicated scalar unit. They need to lean more on vector loads, and this trick allows sharing one vector load with multiple uniform address load loop iterations.
## Results
All results are compared to ```Buffer.Load random``` result (=1.0x) on the same GPU.### AMD GCN2 (R9 390X)
```markdown
Buffer.Load uniform: 11.302ms 3.907x
Buffer.Load linear: 11.327ms 3.899x
Buffer.Load random: 44.150ms 1.000x
Buffer.Load uniform: 49.611ms 0.890x
Buffer.Load linear: 49.835ms 0.886x
Buffer.Load random: 49.615ms 0.890x
Buffer.Load uniform: 44.149ms 1.000x
Buffer.Load linear: 44.806ms 0.986x
Buffer.Load random: 44.164ms 1.000x
Buffer.Load uniform: 11.131ms 3.968x
Buffer.Load linear: 11.139ms 3.965x
Buffer.Load random: 44.076ms 1.002x
Buffer.Load uniform: 49.552ms 0.891x
Buffer.Load linear: 49.560ms 0.891x
Buffer.Load random: 49.559ms 0.891x
Buffer.Load uniform: 44.066ms 1.002x
Buffer.Load linear: 44.687ms 0.988x
Buffer.Load random: 44.066ms 1.002x
Buffer.Load uniform: 11.132ms 3.967x
Buffer.Load linear: 11.139ms 3.965x
Buffer.Load random: 44.071ms 1.002x
Buffer.Load uniform: 49.558ms 0.891x
Buffer.Load linear: 49.560ms 0.891x
Buffer.Load random: 49.559ms 0.891x
Buffer.Load uniform: 44.061ms 1.002x
Buffer.Load linear: 44.613ms 0.990x
Buffer.Load random: 49.583ms 0.891x
ByteAddressBuffer.Load uniform: 10.322ms 4.278x
ByteAddressBuffer.Load linear: 11.546ms 3.825x
ByteAddressBuffer.Load random: 44.153ms 1.000x
ByteAddressBuffer.Load2 uniform: 11.499ms 3.841x
ByteAddressBuffer.Load2 linear: 49.628ms 0.890x
ByteAddressBuffer.Load2 random: 49.651ms 0.889x
ByteAddressBuffer.Load3 uniform: 16.985ms 2.600x
ByteAddressBuffer.Load3 linear: 44.142ms 1.000x
ByteAddressBuffer.Load3 random: 88.176ms 0.501x
ByteAddressBuffer.Load4 uniform: 22.472ms 1.965x
ByteAddressBuffer.Load4 linear: 44.212ms 0.999x
ByteAddressBuffer.Load4 random: 49.346ms 0.895x
ByteAddressBuffer.Load2 unaligned uniform: 11.422ms 3.867x
ByteAddressBuffer.Load2 unaligned linear: 49.552ms 0.891x
ByteAddressBuffer.Load2 unaligned random: 49.561ms 0.891x
ByteAddressBuffer.Load4 unaligned uniform: 22.373ms 1.974x
ByteAddressBuffer.Load4 unaligned linear: 44.095ms 1.002x
ByteAddressBuffer.Load4 unaligned random: 54.464ms 0.811x
StructuredBuffer.Load uniform: 12.585ms 3.509x
StructuredBuffer.Load linear: 11.770ms 3.752x
StructuredBuffer.Load random: 44.176ms 1.000x
StructuredBuffer.Load uniform: 13.210ms 3.343x
StructuredBuffer.Load linear: 50.217ms 0.879x
StructuredBuffer.Load random: 49.645ms 0.890x
StructuredBuffer.Load uniform: 13.818ms 3.196x
StructuredBuffer.Load random: 49.666ms 0.889x
StructuredBuffer.Load linear: 44.721ms 0.988x
cbuffer{float4} load uniform: 16.702ms 2.644x
cbuffer{float4} load linear: 44.447ms 0.994x
cbuffer{float4} load random: 49.656ms 0.889x
Texture2D.Load uniform: 44.214ms 0.999x
Texture2D.Load linear: 44.795ms 0.986x
Texture2D.Load random: 44.808ms 0.986x
Texture2D.Load uniform: 49.706ms 0.888x
Texture2D.Load linear: 50.231ms 0.879x
Texture2D.Load random: 50.200ms 0.880x
Texture2D.Load uniform: 44.760ms 0.987x
Texture2D.Load linear: 45.339ms 0.974x
Texture2D.Load random: 45.405ms 0.973x
Texture2D.Load uniform: 44.175ms 1.000x
Texture2D.Load linear: 44.157ms 1.000x
Texture2D.Load random: 44.096ms 1.002x
Texture2D.Load uniform: 49.739ms 0.888x
Texture2D.Load linear: 49.661ms 0.889x
Texture2D.Load random: 49.622ms 0.890x
Texture2D.Load uniform: 44.257ms 0.998x
Texture2D.Load linear: 44.267ms 0.998x
Texture2D.Load random: 88.126ms 0.501x
Texture2D.Load uniform: 44.259ms 0.998x
Texture2D.Load linear: 44.193ms 0.999x
Texture2D.Load random: 44.099ms 1.001x
Texture2D.Load uniform: 49.739ms 0.888x
Texture2D.Load linear: 49.667ms 0.889x
Texture2D.Load random: 88.110ms 0.501x
Texture2D.Load uniform: 44.288ms 0.997x
Texture2D.Load linear: 66.145ms 0.668x
Texture2D.Load random: 88.124ms 0.501x
```
**AMD GCN2** was a very popular architecture. First card using this architecture was Radeon 7790. Many Radeon 200 and 300 series cards also use this architecture. Both Xbox and PS4 (base model) GPUs are based on GCN2 architecture, making this architecture very important optimization target.**Typed loads:** GCN coalesces linear typed loads. But only 1d loads (R8, R16F, R32F). Coalesced load performance is 4x. Both linear access pattern (all threads in wave load subsequent addresses) and uniform access (all threads in wave load the same address) coalesce perfectly. Typed loads of every dimension (1d/2d/4d) and channel width (8b/16b/32b) perform identically. Best bytes/cycle rate can be achieved either by R32 coalesced load (when access pattern suits this) or always with RGBA32 load.
**Raw (ByteAddressBuffer) loads:** Similar to typed loads. 1d raw loads coalesce perfectly (4x) on linear access. Uniform address raw loads generates scalar unit loads on GCN. Scalar loads use separate cache and are stored to separate SGPR register file -> reduced register & cache pressure & doesn't stress vector load path. Scalar 1d load is 4x faster than normal 1d load. Scalar 2d load is 4x faster than normal 2d load. Scalar 4d load is 2x faster than normal 4d load. Unaligned (alignment=4) loads have equal performance to aligned (alignment=8/16). 3d raw linear loads have equal performance to 4d loads, but random 3d loads are slightly slower.
**Texture loads:** Similar performance as typed buffer loads. However no coalescing in 1d linear access and no scalar unit offload of uniform access. Random access of wide formats tends to be slightly slower (but my 2d random produces different access pattern than 1d).
**Structured buffer loads:** Performance is identical to similar width raw buffer loads.
**Cbuffer loads:** AMD GCN architecture doesn't have special constant buffer hardware. Constant buffer load performance is identical to raw and structured buffers. Prefer uniform addresses to allow the compiler to generate scalar loads, which is around 4x faster and has much lower latency and doesn't waste VGPRs.
**Suggestions:** Prefer wide fat 4d loads instead of multiple narrow loads. If you have perfectly linear memory access pattern, 1d coalesced loads are also fast. ByteAddressBuffers (raw loads) have good performance: Full speed 128 bit 4d loads, 4x rate 1d loads (linear access), and the compiler offloads uniform address loads to scalar unit, saving VGPR pressure and vector memory instructions.
These results match with AMDs wide loads & coalescing documents, see: http://gpuopen.com/gcn-memory-coalescing/. I would be glad if AMD released a public document describing all scalar load optimization cases supported by their compiler.
### AMD GCN3 (R9 Fury 56 CU)
```markdown
Buffer.Load uniform: 8.963ms 3.911x
Buffer.Load linear: 8.917ms 3.931x
Buffer.Load random: 35.058ms 1.000x
Buffer.Load uniform: 39.416ms 0.889x
Buffer.Load linear: 39.447ms 0.889x
Buffer.Load random: 39.413ms 0.889x
Buffer.Load uniform: 35.051ms 1.000x
Buffer.Load linear: 35.048ms 1.000x
Buffer.Load random: 35.051ms 1.000x
Buffer.Load uniform: 8.898ms 3.939x
Buffer.Load linear: 8.909ms 3.934x
Buffer.Load random: 35.050ms 1.000x
Buffer.Load uniform: 39.405ms 0.890x
Buffer.Load linear: 39.435ms 0.889x
Buffer.Load random: 39.407ms 0.889x
Buffer.Load uniform: 35.041ms 1.000x
Buffer.Load linear: 35.043ms 1.000x
Buffer.Load random: 35.046ms 1.000x
Buffer.Load uniform: 8.897ms 3.940x
Buffer.Load linear: 8.910ms 3.934x
Buffer.Load random: 35.048ms 1.000x
Buffer.Load uniform: 39.407ms 0.889x
Buffer.Load linear: 39.433ms 0.889x
Buffer.Load random: 39.406ms 0.889x
Buffer.Load uniform: 35.043ms 1.000x
Buffer.Load linear: 35.045ms 1.000x
Buffer.Load random: 39.405ms 0.890x
ByteAddressBuffer.Load uniform: 10.956ms 3.199x
ByteAddressBuffer.Load linear: 9.100ms 3.852x
ByteAddressBuffer.Load random: 35.038ms 1.000x
ByteAddressBuffer.Load2 uniform: 11.070ms 3.166x
ByteAddressBuffer.Load2 linear: 39.413ms 0.889x
ByteAddressBuffer.Load2 random: 39.411ms 0.889x
ByteAddressBuffer.Load3 uniform: 13.534ms 2.590x
ByteAddressBuffer.Load3 linear: 35.047ms 1.000x
ByteAddressBuffer.Load3 random: 70.033ms 0.500x
ByteAddressBuffer.Load4 uniform: 17.944ms 1.953x
ByteAddressBuffer.Load4 linear: 35.072ms 0.999x
ByteAddressBuffer.Load4 random: 39.149ms 0.895x
ByteAddressBuffer.Load2 unaligned uniform: 11.209ms 3.127x
ByteAddressBuffer.Load2 unaligned linear: 39.408ms 0.889x
ByteAddressBuffer.Load2 unaligned random: 39.406ms 0.890x
ByteAddressBuffer.Load4 unaligned uniform: 17.933ms 1.955x
ByteAddressBuffer.Load4 unaligned linear: 35.066ms 1.000x
ByteAddressBuffer.Load4 unaligned random: 43.241ms 0.811x
StructuredBuffer.Load uniform: 12.653ms 2.770x
StructuredBuffer.Load linear: 8.913ms 3.932x
StructuredBuffer.Load random: 35.059ms 1.000x
StructuredBuffer.Load uniform: 12.799ms 2.739x
StructuredBuffer.Load linear: 39.445ms 0.889x
StructuredBuffer.Load random: 39.413ms 0.889x
StructuredBuffer.Load uniform: 12.834ms 2.731x
StructuredBuffer.Load linear: 35.049ms 1.000x
StructuredBuffer.Load random: 39.411ms 0.889x
cbuffer{float4} load uniform: 14.861ms 2.359x
cbuffer{float4} load linear: 35.534ms 0.986x
cbuffer{float4} load random: 39.412ms 0.889x
Texture2D.Load uniform: 35.063ms 1.000x
Texture2D.Load linear: 35.038ms 1.000x
Texture2D.Load random: 35.040ms 1.000x
Texture2D.Load uniform: 39.430ms 0.889x
Texture2D.Load linear: 39.436ms 0.889x
Texture2D.Load random: 39.436ms 0.889x
Texture2D.Load uniform: 35.059ms 1.000x
Texture2D.Load linear: 35.061ms 1.000x
Texture2D.Load random: 35.055ms 1.000x
Texture2D.Load uniform: 35.056ms 1.000x
Texture2D.Load linear: 35.038ms 1.000x
Texture2D.Load random: 35.040ms 1.000x
Texture2D.Load uniform: 39.431ms 0.889x
Texture2D.Load linear: 39.440ms 0.889x
Texture2D.Load random: 39.436ms 0.889x
Texture2D.Load uniform: 35.054ms 1.000x
Texture2D.Load linear: 35.061ms 1.000x
Texture2D.Load random: 70.037ms 0.500x
Texture2D.Load uniform: 35.055ms 1.000x
Texture2D.Load linear: 35.041ms 1.000x
Texture2D.Load random: 35.041ms 1.000x
Texture2D.Load uniform: 39.433ms 0.889x
Texture2D.Load linear: 39.439ms 0.889x
Texture2D.Load random: 70.039ms 0.500x
Texture2D.Load uniform: 35.054ms 1.000x
Texture2D.Load linear: 52.549ms 0.667x
Texture2D.Load random: 70.037ms 0.500x
```
**AMD GCN3** results (ratios) are identical to GCN2. See GCN2 for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between different GCN revisions.### AMD GCN4 (RX 480)
```markdown
Buffer.Load uniform: 11.008ms 3.900x
Buffer.Load linear: 11.187ms 3.838x
Buffer.Load random: 42.906ms 1.001x
Buffer.Load uniform: 48.280ms 0.889x
Buffer.Load linear: 48.685ms 0.882x
Buffer.Load random: 48.246ms 0.890x
Buffer.Load uniform: 42.911ms 1.001x
Buffer.Load linear: 43.733ms 0.982x
Buffer.Load random: 42.934ms 1.000x
Buffer.Load uniform: 10.852ms 3.956x
Buffer.Load linear: 10.840ms 3.961x
Buffer.Load random: 42.820ms 1.003x
Buffer.Load uniform: 48.153ms 0.892x
Buffer.Load linear: 48.161ms 0.891x
Buffer.Load random: 48.161ms 0.891x
Buffer.Load uniform: 42.832ms 1.002x
Buffer.Load linear: 42.900ms 1.001x
Buffer.Load random: 42.844ms 1.002x
Buffer.Load uniform: 10.852ms 3.956x
Buffer.Load linear: 10.841ms 3.960x
Buffer.Load random: 42.816ms 1.003x
Buffer.Load uniform: 48.158ms 0.892x
Buffer.Load linear: 48.161ms 0.891x
Buffer.Load random: 48.161ms 0.891x
Buffer.Load uniform: 42.827ms 1.002x
Buffer.Load linear: 42.913ms 1.000x
Buffer.Load random: 48.176ms 0.891x
ByteAddressBuffer.Load uniform: 13.403ms 3.203x
ByteAddressBuffer.Load linear: 11.118ms 3.862x
ByteAddressBuffer.Load random: 42.911ms 1.001x
ByteAddressBuffer.Load2 uniform: 13.503ms 3.180x
ByteAddressBuffer.Load2 linear: 48.235ms 0.890x
ByteAddressBuffer.Load2 random: 48.242ms 0.890x
ByteAddressBuffer.Load3 uniform: 16.646ms 2.579x
ByteAddressBuffer.Load3 linear: 42.913ms 1.001x
ByteAddressBuffer.Load3 random: 85.682ms 0.501x
ByteAddressBuffer.Load4 uniform: 21.836ms 1.966x
ByteAddressBuffer.Load4 linear: 42.929ms 1.000x
ByteAddressBuffer.Load4 random: 47.936ms 0.896x
ByteAddressBuffer.Load2 unaligned uniform: 13.454ms 3.191x
ByteAddressBuffer.Load2 unaligned linear: 48.150ms 0.892x
ByteAddressBuffer.Load2 unaligned random: 48.163ms 0.891x
ByteAddressBuffer.Load4 unaligned uniform: 21.765ms 1.973x
ByteAddressBuffer.Load4 unaligned linear: 42.853ms 1.002x
ByteAddressBuffer.Load4 unaligned random: 52.866ms 0.812x
StructuredBuffer.Load uniform: 15.513ms 2.768x
StructuredBuffer.Load linear: 10.895ms 3.941x
StructuredBuffer.Load random: 42.885ms 1.001x
StructuredBuffer.Load uniform: 15.695ms 2.736x
StructuredBuffer.Load linear: 48.231ms 0.890x
StructuredBuffer.Load random: 48.217ms 0.890x
StructuredBuffer.Load uniform: 15.810ms 2.716x
StructuredBuffer.Load linear: 42.907ms 1.001x
StructuredBuffer.Load random: 48.224ms 0.890x
cbuffer{float4} load uniform: 17.249ms 2.489x
cbuffer{float4} load linear: 43.054ms 0.997x
cbuffer{float4} load random: 48.214ms 0.890x
Texture2D.Load uniform: 42.889ms 1.001x
Texture2D.Load linear: 42.877ms 1.001x
Texture2D.Load random: 42.889ms 1.001x
Texture2D.Load uniform: 48.252ms 0.890x
Texture2D.Load linear: 48.254ms 0.890x
Texture2D.Load random: 48.254ms 0.890x
Texture2D.Load uniform: 42.939ms 1.000x
Texture2D.Load linear: 42.969ms 0.999x
Texture2D.Load random: 42.945ms 1.000x
Texture2D.Load uniform: 42.891ms 1.001x
Texture2D.Load linear: 42.915ms 1.000x
Texture2D.Load random: 42.866ms 1.002x
Texture2D.Load uniform: 48.234ms 0.890x
Texture2D.Load linear: 48.365ms 0.888x
Texture2D.Load random: 48.220ms 0.890x
Texture2D.Load uniform: 42.911ms 1.001x
Texture2D.Load linear: 42.943ms 1.000x
Texture2D.Load random: 85.655ms 0.501x
Texture2D.Load uniform: 42.896ms 1.001x
Texture2D.Load linear: 42.910ms 1.001x
Texture2D.Load random: 42.871ms 1.001x
Texture2D.Load uniform: 48.239ms 0.890x
Texture2D.Load linear: 48.367ms 0.888x
Texture2D.Load random: 85.634ms 0.501x
Texture2D.Load uniform: 42.927ms 1.000x
Texture2D.Load linear: 64.284ms 0.668x
Texture2D.Load random: 85.638ms 0.501x
```
**AMD GCN4** results (ratios) are identical to GCN2/3. See GCN2 for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between different GCN revisions.### AMD GCN5 (Vega Frontier Edition)
```markdown
Buffer.Load uniform: 6.024ms 3.693x
Buffer.Load linear: 5.798ms 3.838x
Buffer.Load random: 21.411ms 1.039x
Buffer.Load uniform: 21.648ms 1.028x
Buffer.Load linear: 21.108ms 1.054x
Buffer.Load random: 21.721ms 1.024x
Buffer.Load uniform: 22.315ms 0.997x
Buffer.Load linear: 22.055ms 1.009x
Buffer.Load random: 22.251ms 1.000x
Buffer.Load uniform: 6.421ms 3.465x
Buffer.Load linear: 6.119ms 3.636x
Buffer.Load random: 21.534ms 1.033x
Buffer.Load uniform: 21.010ms 1.059x
Buffer.Load linear: 20.785ms 1.071x
Buffer.Load random: 20.903ms 1.064x
Buffer.Load uniform: 21.083ms 1.055x
Buffer.Load linear: 22.849ms 0.974x
Buffer.Load random: 22.189ms 1.003x
Buffer.Load uniform: 6.374ms 3.491x
Buffer.Load linear: 6.265ms 3.552x
Buffer.Load random: 21.892ms 1.016x
Buffer.Load uniform: 21.918ms 1.015x
Buffer.Load linear: 21.081ms 1.056x
Buffer.Load random: 22.866ms 0.973x
Buffer.Load uniform: 22.022ms 1.010x
Buffer.Load linear: 22.025ms 1.010x
Buffer.Load random: 24.889ms 0.894x
ByteAddressBuffer.Load uniform: 5.187ms 4.289x
ByteAddressBuffer.Load linear: 6.682ms 3.330x
ByteAddressBuffer.Load random: 22.153ms 1.004x
ByteAddressBuffer.Load2 uniform: 5.907ms 3.767x
ByteAddressBuffer.Load2 linear: 21.541ms 1.033x
ByteAddressBuffer.Load2 random: 22.435ms 0.992x
ByteAddressBuffer.Load3 uniform: 8.896ms 2.501x
ByteAddressBuffer.Load3 linear: 22.019ms 1.011x
ByteAddressBuffer.Load3 random: 43.438ms 0.512x
ByteAddressBuffer.Load4 uniform: 10.671ms 2.085x
ByteAddressBuffer.Load4 linear: 20.912ms 1.064x
ByteAddressBuffer.Load4 random: 23.508ms 0.947x
ByteAddressBuffer.Load2 unaligned uniform: 6.080ms 3.660x
ByteAddressBuffer.Load2 unaligned linear: 21.813ms 1.020x
ByteAddressBuffer.Load2 unaligned random: 22.436ms 0.992x
ByteAddressBuffer.Load4 unaligned uniform: 11.457ms 1.942x
ByteAddressBuffer.Load4 unaligned linear: 21.817ms 1.020x
ByteAddressBuffer.Load4 unaligned random: 27.530ms 0.808x
StructuredBuffer.Load uniform: 6.384ms 3.486x
StructuredBuffer.Load linear: 6.314ms 3.524x
StructuredBuffer.Load random: 21.424ms 1.039x
StructuredBuffer.Load uniform: 6.257ms 3.556x
StructuredBuffer.Load linear: 20.940ms 1.063x
StructuredBuffer.Load random: 23.044ms 0.966x
StructuredBuffer.Load uniform: 6.620ms 3.361x
StructuredBuffer.Load linear: 21.771ms 1.022x
StructuredBuffer.Load random: 25.229ms 0.882x
cbuffer{float4} load uniform: 8.011ms 2.778x
cbuffer{float4} load linear: 22.951ms 0.969x
cbuffer{float4} load random: 24.806ms 0.897x
Texture2D.Load uniform: 22.585ms 0.985x
Texture2D.Load linear: 21.733ms 1.024x
Texture2D.Load random: 21.371ms 1.041x
Texture2D.Load uniform: 20.774ms 1.071x
Texture2D.Load linear: 20.806ms 1.069x
Texture2D.Load random: 22.936ms 0.970x
Texture2D.Load uniform: 22.022ms 1.010x
Texture2D.Load linear: 21.644ms 1.028x
Texture2D.Load random: 22.586ms 0.985x
Texture2D.Load uniform: 22.620ms 0.984x
Texture2D.Load linear: 22.730ms 0.979x
Texture2D.Load random: 21.356ms 1.042x
Texture2D.Load uniform: 20.722ms 1.074x
Texture2D.Load linear: 20.723ms 1.074x
Texture2D.Load random: 21.893ms 1.016x
Texture2D.Load uniform: 22.287ms 0.998x
Texture2D.Load linear: 22.116ms 1.006x
Texture2D.Load random: 42.739ms 0.521x
Texture2D.Load uniform: 21.325ms 1.043x
Texture2D.Load linear: 21.370ms 1.041x
Texture2D.Load random: 21.393ms 1.040x
Texture2D.Load uniform: 20.747ms 1.072x
Texture2D.Load linear: 20.754ms 1.072x
Texture2D.Load random: 41.415ms 0.537x
Texture2D.Load uniform: 20.551ms 1.083x
Texture2D.Load linear: 31.748ms 0.701x
Texture2D.Load random: 42.097ms 0.529x
```
**AMD GCN5** results (ratios) are identical to GCN2/3/4. See GCN2 for analysis. Clock and SM scaling reveal that there's no bandwidth/issue related changes in the texture/L1$ architecture between different GCN revisions.### AMD GCN5 7nm (Radeon VII)
```markdown
Buffer.Load uniform: 5.214ms 3.667x
Buffer.Load linear: 5.332ms 3.586x
Buffer