Branchless Conditionals (Compiler Optimization Technique)

Filed under Uncategorized

One of the neater and lesser-known optimizations a compiler can perform is a “branchless conditional” – a conditional statement that contains no branches.

To understand what this means, take a look at the following C-snippet:

When compiled with no optimizations in Visual Studio, this is the output

Nothing surprising here – almost a literal translation from C to assembly. Small, fast, efficient – what could we possibly improve upon?

One possible improvement could be removing the conditional branch – the jne instruction.

Branch instructions are not in-and-of themselves particularly slow; all a jmp instruction does is write a value to the Program Counter (PC) register, and on a simple CPU which reads and executes one instruction at a time, jne wouldn’t be slower than any other instruction which writes to a register. However, modern CPUs are anything but simple.

On modern CPUs, instructions are not read then immediately executed, one-by-one. Rather, the process is split into multiple stages, called the Instruction Pipeline. This allows, for example, the instruction two instructions in the future to be fetched (read) while the next instruction is being decoded and the current instruction is being executed, and the previous instruction is writing its results to memory! This speeds up execution time enormously, but now that the CPU is essentially executing multiple instructions at once, the CPU designers must take extreme precautions to make sure that the results of the execution are still correct!

And herein lies the problem with conditional jumps like jne: How does the processor know what the “next” instruction will be before it actually executes the conditional? The answer is: it doesn’t. It can guess (real-world programs are highly predictable, so Intel CPUs are able to guess correctly ~90% of the time), but if it guesses incorrectly, the entire Instruction Pipeline must be flushed: all the work the CPU did fetching and decoding the instructions it thought were going to be executed is invalidated and completely removed from the pipeline, and the entire pipeline sits around doing nothing while it waits for the jne instruction to complete. Essentially, a jne instruction with a failed prediction causes our fancy expensive pipelined CPU to act as though it were a boring old non-pipelined CPU.

(In fact, the situation is even more complicated than that: modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if there is only one core and the program is single-threaded. When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes)

Knowing all of this, let’s see how Visual Studio compiles the above code with optimizations enabled:

Whoa, what happened there? There are no longer any branch instructions! Let’s translate this back to C, to help see what’s going on:

It appears the branches have been replaced with some clever math and bit-fiddling. Here’s how it works (don’t feel bad if you have to read this over a few times, this is complex stuff!):

  • We set eax = SomeFunc() - 4. Thus, if SomeFunc() == 4, eax = 0; otherwise, eax != 0.
  • neg eax will set the carry-flag to 0 if eax == 0, or 1 otherwise. Thus, the carry-flag gets set to (eax != 0).sbb A, B (sbb means “subtract with borrow”) essentially does A = A - (B + carryFlag). Thus, sbb eax, eax sets eax = 0 if the carry-flag was 0, or eax = -1 if the carry-flag was 1. At this point, if SomeFunc() == 4, eax == 0. Otherwise, eax == -1
  • Finally, we take the bitwise-AND of eax with -52, and add 54. Since 0 & -52 == 0 and -1 & -52 == -52 (Remember, in 2’s complement -1 has all bits set, so -1 & x always equals x), this means that if SomeFunc() == 4 (and thus eax == 0), the result will be 0 + 54 = 54; while if SomeFunc() != 4 (and thus eax == -1), the result will be -52 + 54 = 2. This is the exact same result as our original function!

So now the important question: Are branchless conditionals really faster?

Well, that depends. If the branch-predictor in the CPU is consistently right for our code, the instruction pipeline would never need to be flushed, and the code with branches would be faster – it does, after all, involve fewer operations. If, however, the branch-predictor is wrong often enough, the instruction pipeline will need to be flushed all/most of the time the code is run, which would make the branchless code faster. And unfortunately, there is no way to determine if the branch predictor will guess correctly for your code. Thus, if this is an absolutely critical portion of your code: Profile! Test both code-chunks (branchless and branching), and see which is faster in your case!

And, for the love of god, don’t write your code like the “rough translation” above to try to force the compiler to write branchless code. Not only will you confuse the compiler and probably end up making your code slower, but you will drive whoever has to maintain the code absolutely insane.


Just for reference, here are some more examples of branchless conditionals:

Example from Eldad Eilam’s “Reversing: Secret of Reverse Engineering”:

Example from http://stackoverflow.com/questions/539836/emulating-variable-bit-shift-using-only-constant-shifts

Example from http://stackoverflow.com/questions/1610836/branchless-code-that-maps-zero-negative-and-positive-to-0-1-2


Additional Reading:

10 Comments

  1. Joe says:

    Well that was a fun and informative read. Now I feel like I could program a space shuttle to orbit the moon.Would love a part two… 😉

  2. Jack says:

    A thoroughly interesting read….however:
    Are you sure the claim of throughput gains are valid when the optimized sequence of instructions almost certainly need data forwarding? Have you confirmed the results with test cases?

  3. Hey, very interesting read, however I didn’t get this part:

    “… modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if … When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes”

    Isn’t the idea of a superscalar structure that when a compare happens both parts are calculated and the pipe that guessed wrong is flushed and the other pipe’s answer is used?

    • BlueRaja says:

      @Roy: Perhaps I spoke too tongue-in-cheek. What you are thinking of is called “speculative execution.” It is a common optimization technique in superscalar architectures, but it is not their primary purpose.

      The idea behind superscalar architectures is simple: allow the CPU to fetch/decode/execute multiple instructions at once. They *could* do this by having multiple separate pipelines, but usually they do not; instead, the processor has multiple fetch/decode units, with multiple specialized execution units shared between them.

      I say “specialized” because each execution unit can only run some specific instructions: there is one (or more) for integer arithmetic (ALU), one+ for floating-point arithmetic (FPU), one+ for ‘jmp’ instructions (branch unit), etc. etc. The reason they do this is to remove unnecessary hardware (which takes up space, money, and generates more heat) – and since there are rarely, for instance, 10 unrelated arithmetic expressions in a row, there is no point in having 10 full execution units capable of doing arithmetic when only three will be doing arithmetic at any one time.

      Notice I said “unrelated” arithmetic expressions – because the CPU is *literally* executing multiple instructions at a time, it has to be certain that the results from one instruction don’t rely on the results of another instruction it’s currently in the process of executing. How it does this “dependency checking” is a complicated topic (which is to say, I don’t know 🙂 ), but suffice it to say that even the CPU cannot do a perfect job of determining which future instructions it can execute now, and which it has to buffer until other instructions are done.

      This is why compilers do instruction reordering/interleaving (moving assembly instructions around so that the final result is the same, but painfully difficult for humans to read): to help the superscalar processor dispatch as many instructions as possible each cycle. And that, in turn, is part of the reason it’s so difficult for humans nowadays to write hand-written assembly that is faster than the compiler’s – you have to be intimately familiar with how the processor will dispatch your instructions to each of its execution units.

      I hope this small essay helped clear up some of the confusion 🙂

  4. George K says:

    <>

    So… what’s the alternative?

    I’ve come across this class of issues before and it’s one of the reasons I feel C is an inadequate “cross platform assembly”. As far as I know there is no compiler-agnostic and architecture-agnostic way to indicate that a branch is very unlikely and you end up having to fight the compiler “cleverness” to get it to do what you want. (in a similar vein – trying to get the compiler to inline or not inline some function)

    • BlueRaja says:

      Alternative to what?

      If your application truly needs the extra performance, AND you’ve benchmarked to find the hot-spots, AND you’ve determined the compiler isn’t using a branchless conditional, AND you’ve tested to ensure that using a branchless conditional would be faster, then and ONLY then you’d write the hacky C code to try to force the branchless conditional, or failing that write in assembly. In both cases, you’d also heavily document the code.

  5. Alexander Nava says:

    I have been finding asking myself if I will ever get this good. How much school have you done, was there a click college class that made everything clear, what is your IQ, and how long have you been coding?


Trackbacks/Pingbacks

  1. golb

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*