One of the neater and lesser-known optimizations a compiler can perform is a “branchless conditional” – a conditional statement that contains no branches.
To understand what this means, take a look at the following C-snippet:
1 2 3 4 | if(SomeFunc() == 4) return 54; else return 2; |
When compiled with no optimizations in Visual Studio, this is the output
1 2 3 4 5 6 7 8 9 10 11 12 13 | if(SomeFunc() == 4) 00A0140E call SomeFunc 00A01413 cmp eax, 4 00A01416 jne 00A01421 return 54; 00A01418 mov eax, 54 00A0141D jmp 00A01426 else return 2; 00A01421 mov eax, 2 00A01426 ret |
Nothing surprising here – almost a literal translation from C to assembly. Small, fast, efficient – what could we possibly improve upon?
One possible improvement could be removing the conditional branch – the jne instruction.
Branch instructions are not in-and-of themselves particularly slow; all a jmp instruction does is write a value to the Program Counter (PC) register, and on a simple CPU which reads and executes one instruction at a time, jne wouldn’t be slower than any other instruction which writes to a register. However, modern CPUs are anything but simple.
On modern CPUs, instructions are not read then immediately executed, one-by-one. Rather, the process is split into multiple stages, called the Instruction Pipeline. This allows, for example, the instruction two instructions in the future to be fetched (read) while the next instruction is being decoded and the current instruction is being executed, and the previous instruction is writing its results to memory! This speeds up execution time enormously, but now that the CPU is essentially executing multiple instructions at once, the CPU designers must take extreme precautions to make sure that the results of the execution are still correct!
And herein lies the problem with conditional jumps like jne: How does the processor know what the “next” instruction will be before it actually executes the conditional? The answer is: it doesn’t. It can guess (real-world programs are highly predictable, so Intel CPUs are able to guess correctly ~90% of the time), but if it guesses incorrectly, the entire Instruction Pipeline must be flushed: all the work the CPU did fetching and decoding the instructions it thought were going to be executed is invalidated and completely removed from the pipeline, and the entire pipeline sits around doing nothing while it waits for the jne instruction to complete. Essentially, a jne instruction with a failed prediction causes our fancy expensive pipelined CPU to act as though it were a boring old non-pipelined CPU.
(In fact, the situation is even more complicated than that: modern CPUs essentially have multiple pipelines, meaning they literally execute more than one instruction at once, even if there is only one core and the program is single-threaded. When a conditional-jump prediction fails, all these pipelines must be flushed and stalled, so all the pipelines sit around doing nothing when the conditional-jump executes)
Knowing all of this, let’s see how Visual Studio compiles the above code with optimizations enabled:
1 2 3 4 5 6 | 0012100B call SomeFunc 0012101A sub eax, 4 0012101D neg eax 0012101F sbb eax, eax 00121021 and eax, -52 00121024 add eax, 54 |
Whoa, what happened there? There are no longer any branch instructions! Let’s translate this back to C, to help see what’s going on:
1 2 3 4 | //A rough translation of the above assembly code eax = SomeFunc() - 4; eax = -(eax != 0); //See below eax = (eax & -52) + 54; |
It appears the branches have been replaced with some clever math and bit-fiddling. Here’s how it works (don’t feel bad if you have to read this over a few times, this is complex stuff!):
- We set
eax = SomeFunc() - 4
. Thus, ifSomeFunc() == 4
,eax = 0
; otherwise,eax != 0
. neg eax
will set the carry-flag to 0 ifeax == 0
, or 1 otherwise. Thus, the carry-flag gets set to(eax != 0)
.sbb A, B
(sbb
means “subtract with borrow”) essentially doesA = A - (B + carryFlag)
. Thus,sbb eax, eax
setseax = 0
if the carry-flag was 0, oreax = -1
if the carry-flag was 1. At this point, ifSomeFunc() == 4
,eax == 0
. Otherwise,eax == -1
- Finally, we take the bitwise-AND of
eax
with -52, and add 54. Since0 & -52 == 0
and-1 & -52 == -52
(Remember, in 2’s complement -1 has all bits set, so-1 & x
always equalsx
), this means that ifSomeFunc() == 4
(and thuseax == 0
), the result will be0 + 54 = 54
; while ifSomeFunc() != 4
(and thuseax == -1
), the result will be-52 + 54 = 2
. This is the exact same result as our original function!
So now the important question: Are branchless conditionals really faster?
Well, that depends. If the branch-predictor in the CPU is consistently right for our code, the instruction pipeline would never need to be flushed, and the code with branches would be faster – it does, after all, involve fewer operations. If, however, the branch-predictor is wrong often enough, the instruction pipeline will need to be flushed all/most of the time the code is run, which would make the branchless code faster. And unfortunately, there is no way to determine if the branch predictor will guess correctly for your code. Thus, if this is an absolutely critical portion of your code: Profile! Test both code-chunks (branchless and branching), and see which is faster in your case!
And, for the love of god, don’t write your code like the “rough translation” above to try to force the compiler to write branchless code. Not only will you confuse the compiler and probably end up making your code slower, but you will drive whoever has to maintain the code absolutely insane.
Just for reference, here are some more examples of branchless conditionals:
Example from Eldad Eilam’s “Reversing: Secret of Reverse Engineering”:
1 2 3 4 | if(LocalVariable & 0x00001000) return 1; else return 0; |
1 2 3 4 5 6 | mov eax, [ebp - 10] and eax, 0x00001000 neg eax sbb eax, eax neg eax ret |
Example from http://stackoverflow.com/questions/539836/emulating-variable-bit-shift-using-only-constant-shifts
1 2 3 4 | int isel( int a, int x, int y ) { return (a >= 0 ? x : y); }; |
1 2 3 4 5 6 | int isel( int a, int x, int y ) { int mask = a >> 31; // arithmetic shift right, splat out the sign bit // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise. return (x & (~mask)) + (y & mask); }; |
Example from http://stackoverflow.com/questions/1610836/branchless-code-that-maps-zero-negative-and-positive-to-0-1-2
1 2 3 4 5 6 7 8 9 10 | int Compare(int x, int y) { int diff = x - y; if (diff == 0) return 0; else if (diff < 0) return 1; else return 2; } |
1 2 3 4 5 | int Compare(int x, int y) { int diff = y - x; return (!!diff) << ((diff >> 31) & 1); } |