The pipeline problem
Modern CPUs execute many instructions in overlapping stages. While one instruction is being executed, the next one is being decoded, and the one after that is being fetched from memory.
This works well for straight-line code. A branch, such as an if-statement, loop condition, or switch-case, makes the next instruction uncertain until the condition is evaluated. If the CPU waits, the pipeline stalls for 10-20 cycles.
How branch prediction works
Instead of waiting, the CPU guesses. It has dedicated hardware called the branch predictor that looks at the history of previous branches and predicts which way the current one will go.
If the prediction is correct, the pipeline keeps flowing. The speculatively executed instructions are committed and everything is fast.
If the prediction is wrong, the CPU has to flush the pipeline, throw away all the speculative work, and restart from the correct branch target. This costs 10-20 cycles on modern processors.
// Predictable branch: the predictor learns the pattern quickly
for (int i = 0; i < n; i++) {
if (data[i] < 128) { // same direction for long runs
sum += data[i];
}
}
// Unpredictable branch: 50/50 coin flip every iteration
for (int i = 0; i < n; i++) {
if (random_data[i] < 128) { // no pattern to learn
sum += random_data[i];
}
}The sorted array phenomenon
This is the most famous branch prediction demo. Take an array of random integers from 0-255 and sum all values less than 128:
fn sum_below_threshold(data: &[i32], threshold: i32) -> i64 {
let mut sum: i64 = 0;
for &val in data {
if val < threshold {
sum += val as i64;
}
}
sum
}Run this on a sorted array and an unsorted array. The data and number of additions are the same. The sorted version often runs 3-5x faster.
In the sorted array, the branch goes one direction for the first half, where values are below 128, then switches direction for the second half. The predictor sees long runs and gets the branch right almost every time.
In the unsorted array, the branch is essentially random. The predictor gets it wrong roughly 50% of the time, and each misprediction costs 10-20 cycles.
| Scenario | Misprediction rate | Relative time |
|---|---|---|
| Sorted ascending | ~0.1% | 1x |
| Sorted descending | ~0.1% | 1x |
| Random order | ~50% | 3-5x |
| All same value | ~0% | 0.9x |
Going branchless
When sorting is impractical, you can often eliminate the branch entirely. A branchless expression leaves nothing for the predictor to guess.
// Branching version
if (x < threshold) {
sum += x;
}
// Branchless version
sum += x * (x < threshold);
// Even better with cmov
sum += x & -(x < threshold);The compiler will often generate these branchless versions automatically at higher optimization levels (-O2, -O3). You can check the assembly output to verify.
In Rust, the iterator approach often compiles to branchless code:
let sum: i64 = data.iter()
.filter(|&&x| x < threshold)
.map(|&x| x as i64)
.sum();Practical implications
Sorting before processing is sometimes worth it even if the sort itself costs O(n log n). If you're going to make many passes over the data with branches, the improved prediction accuracy can make the total time lower.
Loop unswitching: if a condition is invariant inside a loop, move the branch outside:
// Branch evaluated n times
for item in items {
if config.use_fast_path {
fast_process(item);
} else {
slow_process(item);
}
}
// Branch evaluated once
if config.use_fast_path {
for item in items { fast_process(item); }
} else {
for item in items { slow_process(item); }
}Avoid data-dependent branches in hot loops. If you're processing millions of elements per frame and the branch outcome depends on the data, consider a branchless alternative or restructuring to sort/partition the data first.
