Cache Miss and Data Locality

We care about performance, and we want our code to be fast – sometimes more so than necessary. But we usually look for problems in the wrong places; we try to optimize our code by using faster algorithms, but never check or care about where we put our data.

In recent years, the increase in CPU speed compared to the increase in memory speed is enormous. As good as it sounds, it is not actually a good thing. In the past, CPU speed and memory speed were similar and they worked in harmony. But today, because of the speed difference, CPU waits for the memory for data and wastes precious cycles doing nothing.

This is called a cache miss. Whenever the CPU wants some piece of data and can’t find it – that is a cache miss. It then looks to a higher level cache – if it can’t find the data there, that’s another cache miss.  It will keep looking into higher and higher level caches and until it can find what it’s looking for.

The computer I am writing this post on is a Mid 2011 Mac Mini, and according to the CPU specs it has one 32KB L1 cache and one 256KB L2 cache on each core, and one 3MB L3 cache on chip. The smaller the size of the cache the faster the cache is.

Cache Miss

The reason for a cache miss is bad data locality. Let’s examine this simple code piece written in c++.

A class when instanced creates an object which is 1KB in size

A class when instanced creates an object which is 32b in size

Here, we loop through the arrays

On average the second loop completes ~7 times faster than the first loop on my machine. The process is the same, which is setting an int field of an object. Why is that? Because the clutter in the Big object causes the CPU to miss the cached data –  the L1 and L2 caches are full of unnecessary data.

Let’s examine how our data is placed on the actual memory using lldb. You can also use gdb, they are very similar.

Compile the script above using clang++ with -g flag to enable debugging with extra information. Here is the simple compile command.

Load the program to lldb using

and then add a simple breakpoint to pause the process without terminating it.

Run the program.

At some point lldb will show the addresses of our two arrays because it is the output of our program. Using those addresses we can examine the memory and see what they have – it may show different addresses on your computer.

Lets examine the first 32 elements of the smalls array using

Look at how nicely the elements are stored on the memory. The first element is 0, the second is 1, the third is 2 and so on, as expected.
Now let’s examine the first 32 elements of the bigs array using


See, it’s full of unnecessary data that causes the CPU to miss the cache.

Conclusion

It has been a very long time since the hype of Data Oriented Programming, and I have been programming in a data oriented way for almost a year now. At first (like most of the things I first encounter) I thought it was the holy grail and would solve each and every problem. But now I think there is a place for OOD and DOD (and in that sense other programming paradigms) in the same software. Both of them have their strengths and weaknesses. Until memory speed catches up with CPU speed, using DOD is a very good idea.

You can find the source files in this file.

If you have any suggestions, or amendments you think we ought to consider, for this blog post, please send me a message at guchan@gram.gs .