The memory wall
Your CPU can execute billions of operations per second. Main memory can deliver data at maybe a few hundred million operations per second. This gap is called the memory wall, and it gets wider every year. CPU speeds improve faster than memory speeds.
Caches exist to bridge this gap. They're small, fast memory banks that sit between your CPU and RAM, keeping copies of recently accessed data close to the processor.
Cache line anatomy
CPUs load memory in cache lines, which are 64 bytes on virtually every modern x86 processor. When you read a single int from memory, the CPU fetches the entire 64-byte block containing that int.
struct player {
float x, y, z; // 12 bytes (positions)
float vx, vy, vz; // 12 bytes (velocity)
int health; // 4 bytes
int armor; // 4 bytes
// Total: 32 bytes — two players fit in one cache line
};If you iterate over an array of these structs, every cache line load gives you two complete players. That's good data density.
Now look at this version:
struct player_bloated {
float x, y, z; // 12 bytes
char name[64]; // 64 bytes
float vx, vy, vz; // 12 bytes
char guild[64]; // 64 bytes
int health; // 4 bytes
int armor; // 4 bytes
char bio[256]; // 256 bytes
// Total: 416 bytes — 6.5 cache lines per player
};Iterating over players to update positions now pulls in 6.5 cache lines per player and uses only 24 bytes from each one. That is 416 bytes loaded to use 24 bytes, or about 5.7% utilization.
Padding and alignment
Compilers insert padding bytes to align struct members to their natural alignment boundaries. A double (8 bytes) wants to start at an address divisible by 8. An int (4 bytes) wants an address divisible by 4.
struct bad_layout {
char a; // 1 byte
// 7 bytes padding
double b; // 8 bytes
char c; // 1 byte
// 3 bytes padding
int d; // 4 bytes
// Total: 24 bytes (8 bytes wasted on padding)
};
struct good_layout {
double b; // 8 bytes
int d; // 4 bytes
char a; // 1 byte
char c; // 1 byte
// 2 bytes padding
// Total: 16 bytes (2 bytes padding)
};Same data, different order. The tighter layout saves 8 bytes per instance. With a million instances, that is 8 MB of memory and significantly fewer cache lines to load.
| Layout | Size | Padding | Cache lines per 1000 |
|---|---|---|---|
| bad_layout | 24 bytes | 8 bytes (33%) | 375 |
| good_layout | 16 bytes | 2 bytes (12%) | 250 |
Hot/cold splitting
Fields in a struct can have very different access patterns. In a game engine, position and velocity get touched every frame. The player's name and bio get touched when the UI renders, which might be once per second.
The hot/cold pattern separates frequently-accessed and rarely-accessed data:
struct PlayerHot {
position: Vec3,
velocity: Vec3,
health: i32,
_pad: i32,
}
struct PlayerCold {
name: String,
guild: String,
bio: String,
inventory: Vec<Item>,
}
struct Players {
hot: Vec<PlayerHot>,
cold: Vec<PlayerCold>,
}When the physics system iterates over all players, it only touches the hot array. The cold data stays out of the way, keeping your cache lines packed with useful data.
Prefetching
Modern CPUs have hardware prefetchers that detect sequential access patterns and start loading the next cache lines before you ask for them. This is why linear iteration over arrays is so fast. The prefetcher sees you reading addresses 0, 64, 128, 192 and starts loading 256, 320, 384 before your code gets there.
Random access patterns break prefetching completely. If you're jumping to addresses 7040, 192, 50816, 3264, the prefetcher can't predict where you're going next. Every access waits for the full memory latency.
This is the main reason arrays often beat linked lists for iteration despite having the same O(n) complexity. The constant factor from cache behavior can be 10-100x.
Measuring cache performance
Use perf stat on Linux to see your cache miss rates:
// High cache miss rate
for (int i = 0; i < n; i++) {
sum += data[indices[i]]; // random access
}
// Low cache miss rate
for (int i = 0; i < n; i++) {
sum += data[i]; // sequential access
}The sequential version will show L1 cache hit rates above 99%. The random version will drop to 50-80% depending on array size relative to cache size.
Practical rules
- Sort struct members by size (largest to smallest) to minimize padding
- Split hot and cold data when structs get large
- Prefer arrays of structs over pointer-heavy data structures when you iterate
- If you need to iterate and also do random lookups, keep both a dense array and an index
- Profile with hardware counters before optimizing; intuition about cache behavior is often wrong
