Objective difference between register and pointer in AVX instructions

Pixel

New Member
#1
Scenario: You are writing a complex algorithm using SIMD. A handful of constants and/or infrequently changing values are used. Ultimately, the algorithm ends up using more than 16 ymm, resulting in the use of stack pointers (e.g. opcode contains vaddps ymm0,ymm1,ymmword ptr [...]instead of vaddps ymm0,ymm1,ymm7).

In order to make the algorithm fit into the available registers, the constants can be "inlined". For example:
Mã:
const auto pi256{ _mm256_set1_ps(PI) };
for (outer condition)
{
    ...
    const auto radius_squared{ _mm256_mul_ps(radius, radius) };
    ...
    for (inner condition)
    {
        ...
        const auto area{ _mm256_mul_ps(radius_squared, pi256) };
        ...
    }
}
... becomes ...
Mã:
for (outer condition)
{
    ...
    for (inner condition)
    {
        ...
        const auto area{ _mm256_mul_ps(_mm256_mul_ps(radius, radius), _mm256_set1_ps(PI)) };
        ...
    }
}
Whether the disposable variable in question is a constant, or is infrequently calculated (calculated outer loop), how can one determine which approach achieves the best throughput? Is it a matter of some concept like "ptr adds 2 extra latency"? Or is it nondeterministic such that it differs on a case-by-case basis and can only be fully optimized through trial-and-error + profiling?
 

Admin

Administrator
Thành viên BQT
#2
A good optimizing compiler should generate the same machine code for both versions. Just define your vector constants as locals, or use them anonymously for maximum readability; let the compiler worry about register allocation and pick the least expensive way to deal with running out of registers if that happens.

Your best bet for helping the compiler is to use fewer different constants if possible. e.g. instead of _mm_and_si128 with both set1_epi16(0x00FF) and 0xFF00, use _mm_andn_si128 to mask the other way. You usually can't do anything to influence which things it chooses to keep in registers vs. not, but fortunately compilers are pretty good at this because it's also essential for scalar code.

A compiler will hoist constants out of the loop (even inlining a helper function containing constants), or if only used in one side of a branch, bring the setup into that side of the branch.

The source code computes exactly the same thing with no difference in visible side-effects, so the as-if rule allows the compiler the freedom to do this.

I think compilers normally do register allocation and choose what to spill/reload (or just use a read-only vector constant) after doing CSE (common subexpression elimination) and identifying loop invariants and constants that can be hoisted.

When it finds it doesn't have enough registers to keep all variables and constants in regs inside the loop, the first choice for something to not keep in a register would normally be a loop-invariant vector, either a compile-time constant or something computed before the loop.

An extra load that hits in L1d cache is cheaper than storing (aka spilling) / reloading a variable inside the loop. Thus, compilers will choose to load constants from memory regardless of where you put the definition in the source code.

Part of the point of writing in C++ is that you have a compiler to make this decision for you. Since it's allowed to do the same thing for both sources, doing different things would be a missed-optimization for at least one of the cases. (The best thing to do in any particular case depends on surrounding code, but normally using vector constants as memory source operands is fine when the compiler runs low on regs.)

Is it a matter of some concept like "ptr adds 2 extra latency"?
Micro-fusion of a memory source operand doesn't lengthen the critical path from the non-constant input to the output. The load uop can start as soon as the address is ready, and for vector constants it's usually either a RIP-relative or [rsp+constant] addressing mode. So usually the load is ready to execute as soon as it's issued into the out-of-order part of the core. Assuming an L1d cache hit (since it will stay hot in cache if loaded every loop iteration), this is only ~5 cycles, so it will easily be ready in time if there's a dependency-chain bottleneck on the vector register input.

It doesn't even hurt front-end throughput. Unless you're bottlenecked on load-port throughput (2 loads per clock on modern x86 CPUs), it typically makes no difference. (Even with highly accurate measurement techniques.)
 

Từ khóa phổ biến

You are using an out of date browser. It may not display this or other websites correctly.
You should upgrade or use an alternative browser.

Top