Branch Prediction and Sorted Data

branch outcomes

Predictability

sorted~0.1% misses

descending~0.1% misses

random~50% misses

branchlesspredictor bypass

speculation loop

Pipeline Cost

1guess

The predictor sends the front end down one control-flow path.

2speculate

Instructions execute before the branch has truly resolved.

3commit

A correct guess keeps throughput smooth and steady.

4flush

A miss throws away speculative work and leaves a visible stall.

The pipeline problem

Modern CPUs execute many instructions in overlapping stages. While one instruction is being executed, the next one is being decoded, and the one after that is being fetched from memory.

This works well for straight-line code. A branch, such as an if-statement, loop condition, or switch-case, makes the next instruction uncertain until the condition is evaluated. If the CPU waits, the for 10-20 cycles.

How works

Instead of waiting, the CPU guesses. It has dedicated hardware called the branch predictor that looks at the history of previous branches and predicts which way the current one will go.

If the prediction is correct, the pipeline keeps flowing. The speculatively executed instructions are committed and everything is fast.

If the prediction is wrong, the CPU has to flush the pipeline, throw away all the speculative work, and restart from the correct branch target. This costs 10-20 cycles on modern processors.

// Predictable branch: the predictor learns the pattern quickly
for (int i = 0; i < n; i++) {
    if (data[i] < 128) {  // same direction for long runs
        sum += data[i];
    }
}

// Unpredictable branch: 50/50 coin flip every iteration
for (int i = 0; i < n; i++) {
    if (random_data[i] < 128) {  // no pattern to learn
        sum += random_data[i];
    }
}

The sorted array phenomenon

This is the most famous branch prediction demo. Take an array of random integers from 0-255 and sum all values less than 128:

fn sum_below_threshold(data: &[i32], threshold: i32) -> i64 {
    let mut sum: i64 = 0;
    for &val in data {
        if val < threshold {
            sum += val as i64;
        }
    }
    sum
}

Run this on a sorted array and an unsorted array. The data and number of additions are the same. The sorted version often runs 3-5x faster.

In the sorted array, the branch goes one direction for the first half, where values are below 128, then switches direction for the second half. The predictor sees long runs and gets the branch right almost every time.

In the unsorted array, the branch is essentially random. The predictor gets it wrong roughly 50% of the time, and each misprediction costs 10-20 cycles.

branch outcomes

Predictability

sorted~0.1% misses

descending~0.1% misses

random~50% misses

branchlesspredictor bypass

Scenario	Misprediction rate	Relative time
Sorted ascending	~0.1%	1x
Sorted descending	~0.1%	1x
Random order	~50%	3-5x
All same value	~0%	0.9x

Going branchless

When sorting is impractical, you can often eliminate the branch entirely. A branchless expression leaves nothing for the predictor to guess.

// Branching version
if (x < threshold) {
    sum += x;
}

// Branchless version
sum += x * (x < threshold);

// Even better with cmov
sum += x & -(x < threshold);

The compiler will often generate these branchless versions automatically at higher optimization levels (-O2, -O3). You can check the assembly output to verify.

In Rust, the iterator approach often compiles to branchless code:

speculation loop

Pipeline Cost

1guess

The predictor sends the front end down one control-flow path.

2speculate

Instructions execute before the branch has truly resolved.

3commit

A correct guess keeps throughput smooth and steady.

4flush

A miss throws away speculative work and leaves a visible stall.

let sum: i64 = data.iter()
    .filter(|&&x| x < threshold)
    .map(|&x| x as i64)
    .sum();

Practical implications

Sorting before processing is sometimes worth it even if the sort itself costs O(n log n). If you're going to make many passes over the data with branches, the improved prediction accuracy can make the total time lower.

Loop unswitching: if a condition is invariant inside a loop, move the branch outside:

// Branch evaluated n times
for item in items {
    if config.use_fast_path {
        fast_process(item);
    } else {
        slow_process(item);
    }
}

// Branch evaluated once
if config.use_fast_path {
    for item in items { fast_process(item); }
} else {
    for item in items { slow_process(item); }
}

Avoid data-dependent branches in hot loops. If you're processing millions of elements per frame and the branch outcome depends on the data, consider a branchless alternative or restructuring to sort/partition the data first.

Predictability

Pipeline Cost

The pipeline problem

How branch prediction works

The sorted array phenomenon

Predictability

Going branchless

Pipeline Cost

Practical implications

How works