-6.9 C
New York

GPU Reshape Beta 2 Release

Published:

The new AMD|Waterfall feature is one of the main additions to the Beta 2 release. The feature is two-fold. It includes an AMD-specific analysis part that highlights a potential performance intensive code-section, and a general part that validates if the NonUniformResourceIndex intrinsic is used where required.

What does waterfall in this context mean? GPU Reshape gives the following explanation:

waterfall-feature-with-code

As indicated, waterfall loops can occur if registers cannot be dynamically indexed. It’s a way to scalarize a non-uniform variable. A non-uniform variable – also called a varying – is stored in a wave-wide vector register (VGPR) where every thread has its own potentially unique value.

thread 0 thread 1 thread 2 thread 3
VGPR variable 4 10 4 6

If a variable is uniform, every thread in a wave holds the same value.

thread 0 thread 1 thread 2 thread 3
VGPR variable 4 4 4 4

As every thread holds the same value, it’s a bit wasteful to use a VGPR to hold a copy of the same value for every thread. It’s enough to just store it once. Modern AMD GPUs take advantage of that and so each workgroup processor (WGP), which is the part of the architecture which executes shader programs, has a separate scalar ALU which comes with its own scalar registers (SGPRs).

wave (all threads)
SGPR variable 4

Scalarizing a uniform variable that is stored in a VGPR is straight forward – we can just pick the value of any thread and put it in a SGPR, e.g. via the HLSL wave operation WaveReadLaneFirst.

However, the more interesting case is if a varying variable needs to be scalarized. Sometimes the compiler needs to do this as certain hardware operations require thread-level uniformity, but the higher level shading language allows the programmer to write their code with non-uniformity. Examples are indexing into resource arrays, indexing into a descriptor heap if we use a bindless rendering model, or dynamic indexing of any array of scalar values, which will be backed by VGPRs.

When scalarizing a non-uniform variable, we can’t just pick a random thread and take its value as the other threads might have different values. A solution to this problem is called waterfalling. There are other solutions our compiler can do, which we won’t go into detail in this blogpost, but all of them incur a cost. A waterfall loop essentially loops over the non-uniform variable as many times as there are unique values. In pseudo-code, it would be something like this:

    while (activeThreads)
    {
        scalar = WaveReadLaneFirst(varyingVariable);
        if (varyingVariable == scalar) {
            doOperationThatRequiresScalar(scalar);
            break; // deactivate this thread
        }
    }
    // after the while loop, all threads are active again

In the above example, if you had 4 threads with 3 unique values, the while loop would have 3 iterations.

The more per-thread unique values a non-uniform variable holds, the more iterations are required with the worst case being the total number of active threads in the wave. This can be quite costly. GPU Reshape’s AMD|Waterfall feature is intended to identify cases where waterfalling might happen, so that the developer can inspect the shader and determine if a waterfall loop has an impact on performance or not. If it has a performance impact, for example because the number of unique values of the non-uniform variable within a wave is typically very high, it might be a good idea to try techniques that do not require a waterfall loop or that reduce the number of required loop iterations.

Let’s look at one concrete example that involves dynamic indexing into an array.

Arrays can be used to implement a stack.

    uint stack[32];
    uint currentStackIndex;

Each thread has its own stack which it pushes and pops nodes to. The content and current index of the stack can be different per thread, hence the current stack index is a non-uniform variable.

Since the stack size is 32 in this example, the maximum number of iterations for the waterfall loop is 32. Depending on the use case, a smaller stack size might be sufficient, but obviously you can’t go down to 0.

An alternative to a stack based on a VGPR-backed array is a stack based on an array in groupshared memory, which AMD implements in a wave-wide shared memory called LDS.

    groupshared uint lds_stack[32*32];

Note, that we assume a fixed wavesize of 32 in this example.

Using a LDS stack can be a bit counter-intuitive at first, since LDS access is slower than VGPR access. Also, threads do not need to exchange their stack entries between each other, so no efficient data transfer between threads is needed, which is what you would typically use LDS for.

However, dynamic access of an LDS array does not lead to a waterfall loop because AMD GPUs can address LDS individually per thread in the wave. Therefore, it’s a trade-off between increased latency for read and writes to the stack and the latency introduced by a waterfall loop.

There is also an additional factor to consider in this example: a VGPR-backed stack might need to allocate a lot of VGPRs and thus limit the maximum occupancy of a wave. Offloading some of the VGPR pressure to LDS might help to improve occupancy which in turn can improve performance as well. This actually goes the other way too: A large LDS-backed stack can reduce the maximum occupancy, so it might make sense to offload some of that to a VGPR-backed stack.

Which backing store for the stack is now the recommended choice? As usual, the answer is “it depends”. It could make sense to implement both stacks, or a combination of both, and measure the performance characteristics to determine which solution provides the best result. It’s a balance between occupancy limits, latency to reads and writes to the stack and overhead introduced by a VGPR-backed stack due to waterfalling.

GPU Reshape’s AMD|Waterfall feature is intended to identify these cases so that developers can experiment and find the best possible solution for their use-case.

waterfall-feature-code-onlywaterfall-feature-code-only

NonUniformResourceIndex Validation

Additionally to identifying code sections that may lead to a waterfall loop, GPU Reshape’s AMD|Waterfall feature also validates the correct use of the NonUniformResourceIndex intrinsic. Per the HLSL Dynamic Resources specification, an index is uniform when it is used to index into an array of resources or into a descriptor heap in case of a bindless renderer. If the index is non-uniform, it has to be annotated with the NonUniformResourceIndex intrinsic. Failing to do so can lead to undefined behavior on some hardware.

    StructuredBuffer<PerInstanceData> instanceData : register(t0, space0);
    Texture2D materialTextures[] : register(t0, space1);
    
    float4 PSmain(PSInput input, uint instanceID : SV_InstanceID) : SV_TARGET 
    { 
        ...
        // instance ID can be non-uniform in a wave
        uint materialID = instanceData[instanceID].materialID;
        // we need to add the NonUniformResourceIndex flag to the actual resource indexing
        float4 value = materialTextures[NonUniformResourceIndex(materialID)].Sample(s_LinearClamp, input.UV);
        ...
    }

Note: This code snippet is specification-compliant.

GPU Reshape analyzes the index that was used to access the resource during runtime on live shader code. This means that GPU Reshape is able to detect if a wave accesses the array with a uniform index or with a non-uniform index. If the index was non-uniform, but the NonUniformResourceIndex intrinsic is missing, GPU Reshape will send an error message:

divergent-resource-addressingdivergent-resource-addressing

If you want to read more on the topic about non-uniform resource access, you can check out this blog here on GPUOpen: Porting Detroit: Become Human from PlayStation® 4 to PC – Part 2

If you want to know more about the topic of occupancy, I recommend this blogpost to give a read: Occupancy Explained

Initialization and concurrency instrumentation have undergone major improvements in this release, now validating reads and writes on a per-texel level for textures, and per-byte for buffers. This greatly improves accuracy and avoids false-positives with concurrent usage of a single resource across multiple dispatches.

Missing initialization and race conditions will now report the exact coordinate that faulted. For example, if a compute shader clearing a full-screen texture does not clear the last row (e.g., due to faulty dispatching logic), future reads only on said row will fault.

texel-addressing-iltexel-addressing-il

Previously initialization and concurrency validation was tracked on a “per-resource” level, in which, for example, resources are marked as initialized on the first texel/byte write. This means that missing writes to specific regions are not detected.

Additionally, if concurrency is not tracked on the smallest granularity the hardware may write to, there is no reliable way to detect race conditions.

Texel addressing imposes large memory and runtime performance overhead, and will significantly increase instrumentation times (for initialization and concurrency validation). To fall back to per-resource tracking, disable “Texel addressing” in the launch window or application settings (under Features).

texel-addressing-launchtexel-addressing-launch

Please note that per-resource addressing is less accurate, and may report false positives on concurrent usage across multiple dispatches.

The loop feature is no longer experimental with this release, and introduces new safe guards to catch potentially TDR causing loops. Instrumentation now injects both atomic checks, signaled from the host device/CPU, and in-shader iteration limits.

Any loop that either exceeds the iteration limit, or is signaled from the host, will break and report the faulting line of code.

loop-sourceloop-source

The latter, in-shader iteration limits, is fully configurable with any number of maximum iterations. Applications may need to manually tune these numbers for their particular workload.

loop-configurationloop-configuration

Please note that side effects in erroneous loop iterations, below the iteration limit, are not guarded beyond the existing feature set. For example, if the loop iterator is used to index a constant buffer array, the device may be lost regardless of loop instrumentation. Future features will try to address this.

Initialization feature for placed resources

The initialization feature received another upgrade in this release. It now correctly verifies if placed resources have been initialized correctly. The caveat with placed resources is, that by specification a write access within a shader is not a valid initialization if the resource is a render target or a depth stencil target. (See also: info on Microsoft’s site).

It’s important to correctly initialize placed resources even if you are going to overwrite all its pixels as a render target or UAV, because the metadata could contain garbage data. Visual artifacts could survive such an overwrite. This is especially relevant in regards to DCC compressed resources. Failing to initialize a resource correctly can lead to undefined behavior, such as corruptions or GPU hangs.

uninitialized-placed-resourceuninitialized-placed-resource

Multi device/process

GPU Reshape now supports hooking multiple devices and entire process trees. Previously, launches would only connect to the first reporting device, by checking “Capture All Devices” it now hooks and creates a workspace for every device upon creation. It also supports hooking entire process trees, meaning the target process and all of its child processes, should “Attach Child Processes” be enabled.

launch-multi-devicelaunch-multi-device

The latter is particularly helpful for applications such as WebGPU (e.g., Chrome) and bootstrapped launches.

launch-workspaceslaunch-workspaces

Source link

Related articles

Recent articles