performance5 min read|2026-06-05

Cache Lines and Memory Layout

How CPU caches work, why padding exists, and how struct layout changes throughput.

performancecpu-cachememoryoptimization

The memory wall

Your CPU can execute billions of operations per second. Main memory can deliver data at maybe a few hundred million operations per second. This gap is called the memory wall, and it gets wider every year. CPU speeds improve faster than memory speeds.

Caches exist to bridge this gap. They're small, fast memory banks that sit between your CPU and RAM, keeping copies of recently accessed data close to the processor.

Cache line anatomy

CPUs load memory in cache lines, which are 64 bytes on virtually every modern x86 processor. When you read a single int from memory, the CPU fetches the entire 64-byte block containing that int.

struct player {
    float x, y, z;           // 12 bytes (positions)
    float vx, vy, vz;        // 12 bytes (velocity)
    int health;              // 4 bytes
    int armor;               // 4 bytes
    // Total: 32 bytes — two players fit in one cache line
};

If you iterate over an array of these structs, every cache line load gives you two complete players. That's good data density.

Now look at this version:

struct player_bloated {
    float x, y, z;           // 12 bytes
    char name[64];            // 64 bytes
    float vx, vy, vz;        // 12 bytes
    char guild[64];           // 64 bytes
    int health;              // 4 bytes
    int armor;               // 4 bytes
    char bio[256];            // 256 bytes
    // Total: 416 bytes — 6.5 cache lines per player
};

Iterating over players to update positions now pulls in 6.5 cache lines per player and uses only 24 bytes from each one. That is 416 bytes loaded to use 24 bytes, or about 5.7% utilization.

Padding and alignment

Compilers insert padding bytes to align struct members to their natural alignment boundaries. A double (8 bytes) wants to start at an address divisible by 8. An int (4 bytes) wants an address divisible by 4.

struct bad_layout {
    char a;        // 1 byte
                   // 7 bytes padding
    double b;      // 8 bytes
    char c;        // 1 byte
                   // 3 bytes padding
    int d;         // 4 bytes
    // Total: 24 bytes (8 bytes wasted on padding)
};

struct good_layout {
    double b;      // 8 bytes
    int d;         // 4 bytes
    char a;        // 1 byte
    char c;        // 1 byte
                   // 2 bytes padding
    // Total: 16 bytes (2 bytes padding)
};

Same data, different order. The tighter layout saves 8 bytes per instance. With a million instances, that is 8 MB of memory and significantly fewer cache lines to load.

LayoutSizePaddingCache lines per 1000
bad_layout24 bytes8 bytes (33%)375
good_layout16 bytes2 bytes (12%)250

Hot/cold splitting

Fields in a struct can have very different access patterns. In a game engine, position and velocity get touched every frame. The player's name and bio get touched when the UI renders, which might be once per second.

The hot/cold pattern separates frequently-accessed and rarely-accessed data:

struct PlayerHot {
    position: Vec3,
    velocity: Vec3,
    health: i32,
    _pad: i32,
}

struct PlayerCold {
    name: String,
    guild: String,
    bio: String,
    inventory: Vec<Item>,
}

struct Players {
    hot: Vec<PlayerHot>,
    cold: Vec<PlayerCold>,
}

When the physics system iterates over all players, it only touches the hot array. The cold data stays out of the way, keeping your cache lines packed with useful data.

Prefetching

Modern CPUs have hardware prefetchers that detect sequential access patterns and start loading the next cache lines before you ask for them. This is why linear iteration over arrays is so fast. The prefetcher sees you reading addresses 0, 64, 128, 192 and starts loading 256, 320, 384 before your code gets there.

Random access patterns break prefetching completely. If you're jumping to addresses 7040, 192, 50816, 3264, the prefetcher can't predict where you're going next. Every access waits for the full memory latency.

This is the main reason arrays often beat linked lists for iteration despite having the same O(n) complexity. The constant factor from cache behavior can be 10-100x.

Measuring cache performance

Use perf stat on Linux to see your cache miss rates:

// High cache miss rate
for (int i = 0; i < n; i++) {
    sum += data[indices[i]];  // random access
}

// Low cache miss rate
for (int i = 0; i < n; i++) {
    sum += data[i];           // sequential access
}

The sequential version will show L1 cache hit rates above 99%. The random version will drop to 50-80% depending on array size relative to cache size.

Practical rules

  • Sort struct members by size (largest to smallest) to minimize padding
  • Split hot and cold data when structs get large
  • Prefer arrays of structs over pointer-heavy data structures when you iterate
  • If you need to iterate and also do random lookups, keep both a dense array and an index
  • Profile with hardware counters before optimizing; intuition about cache behavior is often wrong
A late-night desk scene lit by glowing monitors
Performance work usually starts with attention.