Cache Lines and Memory Layout

layout pressure

Memory Density

tight struct16 B

bloated struct416 B

cache line64 B

padding waste33%

working set

Hot And Cold Split

1physics fields

Keep position, velocity, and health packed for every-frame passes.

2ui strings

Move names, bios, and inventory off the hot path.

3sequential scan

Let the prefetcher pull future lines before the loop reaches them.

4random jump

Expect larger stalls once address patterns stop making sense.

The memory wall

Your CPU can execute billions of operations per second. Main memory can deliver data at maybe a few hundred million operations per second. This gap is called the memory wall, and it gets wider every year. CPU speeds improve faster than memory speeds.

Caches exist to bridge this gap. They're small, fast memory banks that sit between your CPU and RAM, keeping copies of recently accessed data close to the processor.

anatomy

CPUs load memory in cache lines, which are 64 bytes on virtually every modern x86 processor. When you read a single int from memory, the CPU fetches the entire 64-byte block containing that int.

struct player {
    float x, y, z;           // 12 bytes (positions)
    float vx, vy, vz;        // 12 bytes (velocity)
    int health;              // 4 bytes
    int armor;               // 4 bytes
    // Total: 32 bytes — two players fit in one cache line
};

If you iterate over an array of these structs, every cache line load gives you two complete players. That's good data density.

Now look at this version:

struct player_bloated {
    float x, y, z;           // 12 bytes
    char name[64];            // 64 bytes
    float vx, vy, vz;        // 12 bytes
    char guild[64];           // 64 bytes
    int health;              // 4 bytes
    int armor;               // 4 bytes
    char bio[256];            // 256 bytes
    // Total: 416 bytes — 6.5 cache lines per player
};

Iterating over players to update positions now pulls in 6.5 cache lines per player and uses only 24 bytes from each one. That is 416 bytes loaded to use 24 bytes, or about 5.7% utilization.

layout pressure

Memory Density

tight struct16 B

bloated struct416 B

cache line64 B

padding waste33%

Padding and alignment

Compilers insert padding bytes to align struct members to their natural alignment boundaries. A double (8 bytes) wants to start at an address divisible by 8. An int (4 bytes) wants an address divisible by 4.

struct bad_layout {
    char a;        // 1 byte
                   // 7 bytes padding
    double b;      // 8 bytes
    char c;        // 1 byte
                   // 3 bytes padding
    int d;         // 4 bytes
    // Total: 24 bytes (8 bytes wasted on padding)
};

struct good_layout {
    double b;      // 8 bytes
    int d;         // 4 bytes
    char a;        // 1 byte
    char c;        // 1 byte
                   // 2 bytes padding
    // Total: 16 bytes (2 bytes padding)
};

Same data, different order. The tighter layout saves 8 bytes per instance. With a million instances, that is 8 MB of memory and significantly fewer cache lines to load.

Layout	Size	Padding	Cache lines per 1000
bad_layout	24 bytes	8 bytes (33%)	375
good_layout	16 bytes	2 bytes (12%)	250

Hot/cold splitting

Fields in a struct can have very different access patterns. In a game engine, position and velocity get touched every frame. The player's name and bio get touched when the UI renders, which might be once per second.

The hot/cold pattern separates frequently-accessed and rarely-accessed data:

struct PlayerHot {
    position: Vec3,
    velocity: Vec3,
    health: i32,
    _pad: i32,
}

struct PlayerCold {
    name: String,
    guild: String,
    bio: String,
    inventory: Vec<Item>,
}

struct Players {
    hot: Vec<PlayerHot>,
    cold: Vec<PlayerCold>,
}

When the physics system iterates over all players, it only touches the hot array. The cold data stays out of the way, keeping your cache lines packed with useful data.

working set

Hot And Cold Split

1physics fields

Keep position, velocity, and health packed for every-frame passes.

2ui strings

Move names, bios, and inventory off the hot path.

3sequential scan

Let the prefetcher pull future lines before the loop reaches them.

4random jump

Expect larger stalls once address patterns stop making sense.

Prefetching

Modern CPUs have hardware prefetchers that detect sequential access patterns and start loading the next cache lines before you ask for them. This is why linear iteration over arrays is so fast. The prefetcher sees you reading addresses 0, 64, 128, 192 and starts loading 256, 320, 384 before your code gets there.

Random access patterns break prefetching completely. If you're jumping to addresses 7040, 192, 50816, 3264, the prefetcher can't predict where you're going next. Every access waits for the full memory latency.

This is the main reason arrays often beat linked lists for iteration despite having the same O(n) complexity. The constant factor from cache behavior can be 10-100x.

Measuring cache performance

Use perf stat on Linux to see your cache miss rates:

// High cache miss rate
for (int i = 0; i < n; i++) {
    sum += data[indices[i]];  // random access
}

// Low cache miss rate
for (int i = 0; i < n; i++) {
    sum += data[i];           // sequential access
}

The sequential version will show L1 cache hit rates above 99%. The random version will drop to 50-80% depending on array size relative to cache size.

Practical rules

Sort struct members by size (largest to smallest) to minimize padding
Split hot and cold data when structs get large
Prefer arrays of structs over pointer-heavy data structures when you iterate
If you need to iterate and also do random lookups, keep both a dense array and an index
Profile with hardware counters before optimizing; intuition about cache behavior is often wrong

Memory Density

Hot And Cold Split

The memory wall

Cache line anatomy

Memory Density

Padding and alignment

Hot/cold splitting

Hot And Cold Split

Prefetching

Measuring cache performance

Practical rules

anatomy