performance4 min read|2026-06-03

Branch Prediction and Sorted Data

How your CPU predicts branches, what a misprediction costs, and why sorted arrays often process faster than random ones.

performancecpubranch-predictionoptimization

The pipeline problem

Modern CPUs execute many instructions in overlapping stages. While one instruction is being executed, the next one is being decoded, and the one after that is being fetched from memory.

This works well for straight-line code. A branch, such as an if-statement, loop condition, or switch-case, makes the next instruction uncertain until the condition is evaluated. If the CPU waits, the pipeline stalls for 10-20 cycles.

How branch prediction works

Instead of waiting, the CPU guesses. It has dedicated hardware called the branch predictor that looks at the history of previous branches and predicts which way the current one will go.

If the prediction is correct, the pipeline keeps flowing. The speculatively executed instructions are committed and everything is fast.

If the prediction is wrong, the CPU has to flush the pipeline, throw away all the speculative work, and restart from the correct branch target. This costs 10-20 cycles on modern processors.

// Predictable branch: the predictor learns the pattern quickly
for (int i = 0; i < n; i++) {
    if (data[i] < 128) {  // same direction for long runs
        sum += data[i];
    }
}

// Unpredictable branch: 50/50 coin flip every iteration
for (int i = 0; i < n; i++) {
    if (random_data[i] < 128) {  // no pattern to learn
        sum += random_data[i];
    }
}

The sorted array phenomenon

This is the most famous branch prediction demo. Take an array of random integers from 0-255 and sum all values less than 128:

fn sum_below_threshold(data: &[i32], threshold: i32) -> i64 {
    let mut sum: i64 = 0;
    for &val in data {
        if val < threshold {
            sum += val as i64;
        }
    }
    sum
}

Run this on a sorted array and an unsorted array. The data and number of additions are the same. The sorted version often runs 3-5x faster.

In the sorted array, the branch goes one direction for the first half, where values are below 128, then switches direction for the second half. The predictor sees long runs and gets the branch right almost every time.

In the unsorted array, the branch is essentially random. The predictor gets it wrong roughly 50% of the time, and each misprediction costs 10-20 cycles.

ScenarioMisprediction rateRelative time
Sorted ascending~0.1%1x
Sorted descending~0.1%1x
Random order~50%3-5x
All same value~0%0.9x

Going branchless

When sorting is impractical, you can often eliminate the branch entirely. A branchless expression leaves nothing for the predictor to guess.

// Branching version
if (x < threshold) {
    sum += x;
}

// Branchless version
sum += x * (x < threshold);

// Even better with cmov
sum += x & -(x < threshold);

The compiler will often generate these branchless versions automatically at higher optimization levels (-O2, -O3). You can check the assembly output to verify.

In Rust, the iterator approach often compiles to branchless code:

let sum: i64 = data.iter()
    .filter(|&&x| x < threshold)
    .map(|&x| x as i64)
    .sum();

Practical implications

Sorting before processing is sometimes worth it even if the sort itself costs O(n log n). If you're going to make many passes over the data with branches, the improved prediction accuracy can make the total time lower.

Loop unswitching: if a condition is invariant inside a loop, move the branch outside:

// Branch evaluated n times
for item in items {
    if config.use_fast_path {
        fast_process(item);
    } else {
        slow_process(item);
    }
}

// Branch evaluated once
if config.use_fast_path {
    for item in items { fast_process(item); }
} else {
    for item in items { slow_process(item); }
}

Avoid data-dependent branches in hot loops. If you're processing millions of elements per frame and the branch outcome depends on the data, consider a branchless alternative or restructuring to sort/partition the data first.

A late-night desk scene lit by glowing monitors
Performance work usually starts with attention.