Branchless Conditionals (Compiler Optimization Technique)

One of the neater and lesser-known optimizations a compiler can perform is a “branchless conditional” – a conditional statement that contains no branches.

To understand what this means, take a look at the following C-snippet:

if(SomeFunc() == 4)

return 54;

else

return 2;

When compiled with no optimizations in Visual Studio, this is the output

if(SomeFunc() == 4)

00A0140E call SomeFunc

00A01413 cmp eax, 4

00A01416 jne 00A01421

return 54;

00A01418 mov eax, 54

00A0141D jmp 00A01426

else

return 2;

00A01421 mov eax, 2

00A01426 ret

Nothing surprising here – almost a literal translation from C to assembly. Small, fast, efficient – what could we possibly improve upon?

One possible improvement could be removing the conditional branch – the jne instruction.

Branch instructions are not in-and-of themselves particularly slow; all a jmp instruction does is write a value to the Program Counter (PC) register, and on a simple CPU which reads and executes one instruction at a time, jne wouldn’t be slower than any other instruction which writes to a register. However, modern CPUs are anything but simple.

On modern CPUs, instructions are not read then immediately executed, one-by-one. Rather, the process is split into multiple stages, called the Instruction Pipeline. This allows, for example, the instruction two instructions in the future to be fetched (read) while the next instruction is being decoded and the current instruction is being executed, and the previous instruction is writing its results to memory! This speeds up execution time enormously, but now that the CPU is essentially executing multiple instructions at once, the CPU designers must take extreme precautions to make sure that the results of the execution are still correct!

And herein lies the problem with conditional jumps like jne: How does the processor know what the “next” instruction will be before it actually executes the conditional? The answer is: it doesn’t. It can guess (real-world programs are highly predictable, so Intel CPUs are able to guess correctly ~90% of the time), but if it guesses incorrectly, the entire Instruction Pipeline must be flushed: all the work the CPU did fetching and decoding the instructions it thought were going to be executed is invalidated and completely removed from the pipeline, and the entire pipeline sits around doing nothing while it waits for the jne instruction to complete. Essentially, a jne instruction with a failed prediction causes our fancy expensive pipelined CPU to act as though it were a boring old non-pipelined CPU.

(In fact, the situation is even more complicated than that: modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if there is only one core and the program is single-threaded. When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes)

Knowing all of this, let’s see how Visual Studio compiles the above code with optimizations enabled:

0012100B call SomeFunc

0012101A sub eax, 4

0012101D neg eax

0012101F sbb eax, eax

00121021 and eax, -52

00121024 add eax, 54

Whoa, what happened there? There are no longer any branch instructions! Let’s translate this back to C, to help see what’s going on:

//A rough translation of the above assembly code

eax = SomeFunc() - 4;

eax = -(eax != 0); //See below

eax = (eax & -52) + 54;

It appears the branches have been replaced with some clever math and bit-fiddling. Here’s how it works (don’t feel bad if you have to read this over a few times, this is complex stuff!):

We set eax = SomeFunc() - 4. Thus, if SomeFunc() == 4, eax = 0; otherwise, eax != 0.
neg eax will set the carry-flag to 0 if eax == 0, or 1 otherwise. Thus, the carry-flag gets set to (eax != 0).sbb A, B (sbb means “subtract with borrow”) essentially does A = A - (B + carryFlag). Thus, sbb eax, eax sets eax = 0 if the carry-flag was 0, or eax = -1 if the carry-flag was 1. At this point, if SomeFunc() == 4, eax == 0. Otherwise, eax == -1
Finally, we take the bitwise-AND of eax with -52, and add 54. Since 0 & -52 == 0 and -1 & -52 == -52 (Remember, in 2’s complement -1 has all bits set, so -1 & x always equals x), this means that if SomeFunc() == 4 (and thus eax == 0), the result will be 0 + 54 = 54; while if SomeFunc() != 4 (and thus eax == -1), the result will be -52 + 54 = 2. This is the exact same result as our original function!

So now the important question: Are branchless conditionals really faster?

Well, that depends. If the branch-predictor in the CPU is consistently right for our code, the instruction pipeline would never need to be flushed, and the code with branches would be faster – it does, after all, involve fewer operations. If, however, the branch-predictor is wrong often enough, the instruction pipeline will need to be flushed all/most of the time the code is run, which would make the branchless code faster. And unfortunately, there is no way to determine if the branch predictor will guess correctly for your code. Thus, if this is an absolutely critical portion of your code: Profile! Test both code-chunks (branchless and branching), and see which is faster in your case!

And, for the love of god, don’t write your code like the “rough translation” above to try to force the compiler to write branchless code. Not only will you confuse the compiler and probably end up making your code slower, but you will drive whoever has to maintain the code absolutely insane.

Just for reference, here are some more examples of branchless conditionals:

Example from Eldad Eilam’s “Reversing: Secret of Reverse Engineering”:

if(LocalVariable & 0x00001000)

return 1;

else

return 0;

mov eax, [ebp - 10]

and eax, 0x00001000

neg eax

sbb eax, eax

neg eax

ret

Example from http://stackoverflow.com/questions/539836/emulating-variable-bit-shift-using-only-constant-shifts

int isel( int a, int x, int y )

{

return (a >= 0 ? x : y);

};

int isel( int a, int x, int y )

{

int mask = a >> 31; // arithmetic shift right, splat out the sign bit

// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.

return (x & (~mask)) + (y & mask);

};

Example from http://stackoverflow.com/questions/1610836/branchless-code-that-maps-zero-negative-and-positive-to-0-1-2

int Compare(int x, int y)

{

int diff = x - y;

if (diff == 0)

return 0;

else if (diff < 0)

return 1;

else

return 2;

}

int Compare(int x, int y)

{

int diff = y - x;

return (!!diff) << ((diff >> 31) & 1);

}

Additional Reading:

10 Comments

Joe says:
May 16, 2011 at 11:07 am
Well that was a fun and informative read. Now I feel like I could program a space shuttle to orbit the moon.Would love a part two… 😉
Reply to this comment
Jack says:
June 6, 2011 at 11:39 pm
A thoroughly interesting read….however:
Are you sure the claim of throughput gains are valid when the optimized sequence of instructions almost certainly need data forwarding? Have you confirmed the results with test cases?
Reply to this comment
Roy Triesscheijn says:
June 17, 2011 at 10:57 am
Hey, very interesting read, however I didn’t get this part:
“… modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if … When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes”
Isn’t the idea of a superscalar structure that when a compare happens both parts are calculated and the pipe that guessed wrong is flushed and the other pipe’s answer is used?
Reply to this comment
- BlueRaja says:
  June 17, 2011 at 12:39 pm
  @Roy: Perhaps I spoke too tongue-in-cheek. What you are thinking of is called “speculative execution.” It is a common optimization technique in superscalar architectures, but it is not their primary purpose.
  The idea behind superscalar architectures is simple: allow the CPU to fetch/decode/execute multiple instructions at once. They *could* do this by having multiple separate pipelines, but usually they do not; instead, the processor has multiple fetch/decode units, with multiple specialized execution units shared between them.
  I say “specialized” because each execution unit can only run some specific instructions: there is one (or more) for integer arithmetic (ALU), one+ for floating-point arithmetic (FPU), one+ for ‘jmp’ instructions (branch unit), etc. etc. The reason they do this is to remove unnecessary hardware (which takes up space, money, and generates more heat) – and since there are rarely, for instance, 10 unrelated arithmetic expressions in a row, there is no point in having 10 full execution units capable of doing arithmetic when only three will be doing arithmetic at any one time.
  Notice I said “unrelated” arithmetic expressions – because the CPU is *literally* executing multiple instructions at a time, it has to be certain that the results from one instruction don’t rely on the results of another instruction it’s currently in the process of executing. How it does this “dependency checking” is a complicated topic (which is to say, I don’t know 🙂 ), but suffice it to say that even the CPU cannot do a perfect job of determining which future instructions it can execute now, and which it has to buffer until other instructions are done.
  This is why compilers do instruction reordering/interleaving (moving assembly instructions around so that the final result is the same, but painfully difficult for humans to read): to help the superscalar processor dispatch as many instructions as possible each cycle. And that, in turn, is part of the reason it’s so difficult for humans nowadays to write hand-written assembly that is faster than the compiler’s – you have to be intimately familiar with how the processor will dispatch your instructions to each of its execution units.
  I hope this small essay helped clear up some of the confusion 🙂
  Reply to this comment
  - Zhi Wen says:
    September 11, 2011 at 4:40 am
    so, how do they create such smart compiler? Do they have some kind of special tools to help them?
    Reply to this comment
    - Erkin Alp Güney says:
      April 20, 2015 at 1:10 pm
      Yes, of course. Compiler authors are human beings.
      Reply to this comment
    - BlueRaja says:
      May 7, 2015 at 3:56 pm
      @Zhi You have compiler-authors who really, really know the architecture well. There is a reason that Intel’s C++ compiler consistently produces better-optimized code than Microsoft’s or gcc.
      Reply to this comment
George K says:
March 27, 2016 at 5:18 am
<>
So… what’s the alternative?
I’ve come across this class of issues before and it’s one of the reasons I feel C is an inadequate “cross platform assembly”. As far as I know there is no compiler-agnostic and architecture-agnostic way to indicate that a branch is very unlikely and you end up having to fight the compiler “cleverness” to get it to do what you want. (in a similar vein – trying to get the compiler to inline or not inline some function)
Reply to this comment
- BlueRaja says:
  March 29, 2016 at 7:57 pm
  Alternative to what?
  If your application truly needs the extra performance, AND you’ve benchmarked to find the hot-spots, AND you’ve determined the compiler isn’t using a branchless conditional, AND you’ve tested to ensure that using a branchless conditional would be faster, then and ONLY then you’d write the hacky C code to try to force the branchless conditional, or failing that write in assembly. In both cases, you’d also heavily document the code.
  Reply to this comment
Alexander Nava says:
February 14, 2017 at 2:00 am
I have been finding asking myself if I will ever get this good. How much school have you done, was there a click college class that made everything clear, what is your IQ, and how long have you been coding?
Reply to this comment