Branchless Conditionals (Compiler Optimization Technique)

Filed under Uncategorized

One of the neater and lesser-known optimizations a compiler can perform is a “branchless conditional” – a conditional statement that contains no branches (go figure).

To understand what this means, take a look at the following C-snippet:

if(SomeFunc() == 4)
    return 54;
else
    return 2;

When compiled with no optimizations in Visual Studio, this is the output

    if(SomeFunc() == 4)
00A0140E call SomeFunc (0A01145h)
00A01413 cmp eax, 4
00A01416 jne 0A01421h
        return 54;
00A01418 mov eax, 54
00A0141D jmp 0A01426h
    else
        return 2;
00A01421 mov eax, 2
0A01426h ret

Nothing surprising here – almost a literal translation from C to assembly. Small, fast, efficient – what could we possibly improve upon?

One possible improvement could be removing the branches – the jmp and jne instructions.

Branch instructions are not in-and-of themselves particularly slow; all a jmp instruction does is write a value to the Program Counter (PC) register, and on a simple CPU which reads and executes one instruction at a time, jmp wouldn’t be slower than any other instruction which writes to a register. However, modern CPUs are anything but simple.

On modern CPUs, instructions are not read then immediately executed, one-by-one. Rather, the process is split into multiple stages, called the Instruction Pipeline. This allows, for example, the instruction two instructions in the future to be fetched (read) while the next instruction is being decoded and the current instruction is being executed, and the previous instruction is writing its results to memory! This allows incredible speedups in program execution time, but now that the CPU is essentially executing multiple instructions at once, the CPU designers must take extreme precautions to make sure that the results of the execution are still correct!

And herein lies the problem with conditional jumps like jne: How does the processor know what the “next” instruction will be before it actually executes the conditional? The answer is: it doesn’t. It can guess (I believe I read that the Intel x86 guesses correctly ~60% of the time), but if it guesses incorrectly, the entire Instruction Pipeline must be flushed: all the work the CPU did fetching and decoding the instructions it thought were going to be executed is invalidated and completely removed from the pipeline, and the entire pipeline sits around doing nothing while it waits for the jne instruction to complete. Essentially, a jne instruction with a failed prediction causes our fancy expensive pipelined CPU to act as though it were a boring old non-pipelined CPU.

(In fact, the situation is even more complicated than that: modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if there is only one core and the program is single-threaded. When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes)

Knowing all of this, let’s see how Visual Studio compiles the above code with optimizations enabled:
0012100B call SomeFunc (0A01145h)
0012101A sub eax, 4
0012101D neg eax
0012101F sbb eax, eax
00121021 and eax, -52
00121024 add eax, 54

Whoa, what happened there? There are no longer any branch instructions! Let’s translate this back to C, to help see what’s going on:
//A rough translation of the above assembly code
eax = SomeFunc() - 4;
eax = -(eax != 0); //See below
eax = (eax & -52) + 54;

It appears the branches have been replaced with some clever math and bit-fiddling. Here’s how it works (don’t feel bad if you have to read this over a few times, this is complex stuff!):

  • We set eax = SomeFunc() – 4. Thus, if SomeFunc() == 4, eax = 0; otherwise, eax != 0.
  • neg eax will set the carry-flag to 0 if eax == 0, or 1 otherwise. Thus, the carry-flag gets set to (eax != 0).

    sbb A, B (sbb means “subtract with borrow”) essentially does A = A – (B + carryFlag). Thus, sbb eax, eax sets eax = 0 if the carry-flag was 0, or eax = -1 if the carry-flag was 1.

    At this point, if SomeFunc() == 4, eax == 0. Otherwise, eax == -1

  • Finally, we take the bitwise-AND of eax with -52, and add 54. Since 0 & -52 == 0 and -1 & -52 == -52 (Remember, in 2′s complement -1 has all bits set, so -1 & x always equals x), this means that if SomeFunc() == 4 (and thus eax == 0), the result will be 0 + 54 = 54; while if SomeFunc() != 4 (and thus eax == -1), the result will be -52 + 54 = 2. This is the exact same result as our original function!

So now the important question: Are branchless conditionals really faster?

Well, that depends. If the branch-predictor in the CPU is consistantly right for our code, the instruction pipeline would never need to be flushed, and the code with branches would be faster – it does, after all, involve a lot fewer operations. If, however, the branch-predictor is wrong often enough, the instruction pipeline will need to be flushed all/most of the time the code is run, which would make the branchless code faster. And unfortunately, there is no way to determine if the branch predictor will guess correctly for your code. Thus, if this is an absolutely critical portion of your code: Profile! Test both code-chunks (branchless and branching), and see which is faster in your case!

And, for the love of god, don’t write your code like the “rough translation” above to try to force the compiler to write branchless code. Not only will you confuse the compiler and probably end up making your code slower, but you will drive whoever has to maintain the code absolutely insane.



Just for reference, here are some more examples of branchless conditionals:

Example from Eldad Eilam’s “Reversing: Secret of Reverse Engineering”:

if(LocalVariable & 0x00001000)
    return 1;
else
    return 0;

mov eax, [ebp - 10]
and eax, 0x00001000
neg eax
sbb eax, eax
neg eax
ret

Example from http://stackoverflow.com/questions/539836/emulating-variable-bit-shift-using-only-constant-shifts

int isel( int a, int x, int y )
{
    return (a >= 0 ? x : y);
};

int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return (x & (~mask)) + (y & mask);
};

Example from http://stackoverflow.com/questions/1610836/branchless-code-that-maps-zero-negative-and-positive-to-0-1-2

int Compare(int x, int y)
{
    int diff = x - y;
    if (diff == 0)
        return 0;
    else if (diff < 0)
        return 1;
    else
        return 2;
}

int Compare(int x, int y)
{
    int diff = y - x;
    return (!!diff) << ((diff >> 31) & 1);
}


Additional Reading:

5 Comments

  1. Joe says:

    Well that was a fun and informative read. Now I feel like I could program a space shuttle to orbit the moon.Would love a part two… ;)

  2. Jack says:

    A thoroughly interesting read….however:
    Are you sure the claim of throughput gains are valid when the optimized sequence of instructions almost certainly need data forwarding? Have you confirmed the results with test cases?

  3. Hey, very interesting read, however I didn’t get this part:

    “… modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if … When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes”

    Isn’t the idea of a superscalar structure that when a compare happens both parts are calculated and the pipe that guessed wrong is flushed and the other pipe’s answer is used?

    • BlueRaja says:

      @Roy: Perhaps I spoke too tongue-in-cheek. What you are thinking of is called “speculative execution.” It is a common optimization technique in superscalar architectures, but it is not their primary purpose.

      The idea behind superscalar architectures is simple: allow the CPU to fetch/decode/execute multiple instructions at once. They *could* do this by having multiple separate pipelines, but usually they do not; instead, the processor has multiple fetch/decode units, with multiple specialized execution units shared between them.

      I say “specialized” because each execution unit can only run some specific instructions: there is one (or more) for integer arithmetic (ALU), one+ for floating-point arithmetic (FPU), one+ for ‘jmp’ instructions (branch unit), etc. etc. The reason they do this is to remove unnecessary hardware (which takes up space, money, and generates more heat) – and since there are rarely, for instance, 10 unrelated arithmetic expressions in a row, there is no point in having 10 full execution units capable of doing arithmetic when only three will be doing arithmetic at any one time.

      Notice I said “unrelated” arithmetic expressions – because the CPU is *literally* executing multiple instructions at a time, it has to be certain that the results from one instruction don’t rely on the results of another instruction it’s currently in the process of executing. How it does this “dependency checking” is a complicated topic (which is to say, I don’t know :) ), but suffice it to say that even the CPU cannot do a perfect job of determining which future instructions it can execute now, and which it has to buffer until other instructions are done.

      This is why compilers do instruction reordering/interleaving (moving assembly instructions around so that the final result is the same, but painfully difficult for humans to read): to help the superscalar processor dispatch as many instructions as possible each cycle. And that, in turn, is part of the reason it’s so difficult for humans nowadays to write hand-written assembly that is faster than the compiler’s – you have to be intimately familiar with how the processor will dispatch your instructions to each of its execution units.

      I hope this small essay helped clear up some of the confusion :)


Trackbacks/Pingbacks

  1. golb

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*