Monthly Archives: May 2011

Branchless Conditionals (Compiler Optimization Technique)

10
Filed under Uncategorized

One of the neater and lesser-known optimizations a compiler can perform is a “branchless conditional” – a conditional statement that contains no branches.

To understand what this means, take a look at the following C-snippet:

When compiled with no optimizations in Visual Studio, this is the output

Nothing surprising here – almost a literal translation from C to assembly. Small, fast, efficient – what could we possibly improve upon?

One possible improvement could be removing the conditional branch – the jne instruction.

Branch instructions are not in-and-of themselves particularly slow; all a jmp instruction does is write a value to the Program Counter (PC) register, and on a simple CPU which reads and executes one instruction at a time, jne wouldn’t be slower than any other instruction which writes to a register. However, modern CPUs are anything but simple.

On modern CPUs, instructions are not read then immediately executed, one-by-one. Rather, the process is split into multiple stages, called the Instruction Pipeline. This allows, for example, the instruction two instructions in the future to be fetched (read) while the next instruction is being decoded and the current instruction is being executed, and the previous instruction is writing its results to memory! This speeds up execution time enormously, but now that the CPU is essentially executing multiple instructions at once, the CPU designers must take extreme precautions to make sure that the results of the execution are still correct!

And herein lies the problem with conditional jumps like jne: How does the processor know what the “next” instruction will be before it actually executes the conditional? The answer is: it doesn’t. It can guess (real-world programs are highly predictable, so Intel CPUs are able to guess correctly ~90% of the time), but if it guesses incorrectly, the entire Instruction Pipeline must be flushed: all the work the CPU did fetching and decoding the instructions it thought were going to be executed is invalidated and completely removed from the pipeline, and the entire pipeline sits around doing nothing while it waits for the jne instruction to complete. Essentially, a jne instruction with a failed prediction causes our fancy expensive pipelined CPU to act as though it were a boring old non-pipelined CPU.

(In fact, the situation is even more complicated than that: modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if there is only one core and the program is single-threaded. When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes)

Knowing all of this, let’s see how Visual Studio compiles the above code with optimizations enabled:

Whoa, what happened there? There are no longer any branch instructions! Let’s translate this back to C, to help see what’s going on:

It appears the branches have been replaced with some clever math and bit-fiddling. Here’s how it works (don’t feel bad if you have to read this over a few times, this is complex stuff!):

  • We set eax = SomeFunc() - 4. Thus, if SomeFunc() == 4, eax = 0; otherwise, eax != 0.
  • neg eax will set the carry-flag to 0 if eax == 0, or 1 otherwise. Thus, the carry-flag gets set to (eax != 0).sbb A, B (sbb means “subtract with borrow”) essentially does A = A - (B + carryFlag). Thus, sbb eax, eax sets eax = 0 if the carry-flag was 0, or eax = -1 if the carry-flag was 1. At this point, if SomeFunc() == 4, eax == 0. Otherwise, eax == -1
  • Finally, we take the bitwise-AND of eax with -52, and add 54. Since 0 & -52 == 0 and -1 & -52 == -52 (Remember, in 2’s complement -1 has all bits set, so -1 & x always equals x), this means that if SomeFunc() == 4 (and thus eax == 0), the result will be 0 + 54 = 54; while if SomeFunc() != 4 (and thus eax == -1), the result will be -52 + 54 = 2. This is the exact same result as our original function!

So now the important question: Are branchless conditionals really faster?

Well, that depends. If the branch-predictor in the CPU is consistently right for our code, the instruction pipeline would never need to be flushed, and the code with branches would be faster – it does, after all, involve fewer operations. If, however, the branch-predictor is wrong often enough, the instruction pipeline will need to be flushed all/most of the time the code is run, which would make the branchless code faster. And unfortunately, there is no way to determine if the branch predictor will guess correctly for your code. Thus, if this is an absolutely critical portion of your code: Profile! Test both code-chunks (branchless and branching), and see which is faster in your case!

And, for the love of god, don’t write your code like the “rough translation” above to try to force the compiler to write branchless code. Not only will you confuse the compiler and probably end up making your code slower, but you will drive whoever has to maintain the code absolutely insane.


Just for reference, here are some more examples of branchless conditionals:

Example from Eldad Eilam’s “Reversing: Secret of Reverse Engineering”:

Example from http://stackoverflow.com/questions/539836/emulating-variable-bit-shift-using-only-constant-shifts

Example from http://stackoverflow.com/questions/1610836/branchless-code-that-maps-zero-negative-and-positive-to-0-1-2