Cache aware Matrix Multiplication – Naive isn’t that bad!

This article has been published at Function Space.

(This is just my back up copy of the thing, do check this out on FS for better experience.)

Morpheus said Matrix is everywhere. He’s not very far from the truth! Every thing that we see can be considered a matrix, because image is nothing but a matrix!

And one of the most frequently used and computationally intensive matrix operation is Matrix Multiplication.

For the purpose of this article let’s stick with two square matrices A and B of order n each and take that as our running example.

Volker Strassen came up with a more efficient matrix multiplication algorithm in 1969 which runs in O(n^log7)≈O(n^2.807) time. From then on, many optimizations have been made in this area and the current best runs in O(n^2.3727)time. In fact, {what is the fastest algorithm for matrix multiplication} is still an unsolved problem in computer science.

But this article won’t go into details of such algorithms. We won’t be optimizing matrix multiplication by bringing down its asymptotic complexity. Instead, we will stick to this naive O(n^3) algorithm and optimize it for our machines. You’d be surprised to see the results some simple tweaks can bring and how speedy this seemingly slow algorithm can become! So let’s dive in!

We will start by understanding what is a cache and some of its properties that can help our codes run faster!

Cache is a fast memory, usually much faster than main memory because of its hardware design and retrieval characteristics. It is used to store copy of some data from main memory locations for faster access. But then why not use cache for each and every data item? Because it is quite expensive and hence usually smaller in size than main memory. So, what we have is a small, fast memory (cache) and a bigger rather slower memory (main memory). To reduce the overall memory access time, we want to exploit cache to the fullest so that maximum memory accesses are fulfilled by cache. To do so, we want to store the frequently used main memory locations into the cache. For that, cache is built in such a way such that it exploits what is called as principle of locality.

Principle of locality takes two forms:

  1. Temporal Locality: says that if any memory location is accessed now, it is more likely to be accessed again in near future.
  2. Spatial Locality: says that if any memory location is accessed now, then it’s neighboring memory locations are expected to be accessed in near future.

Intuitively, we can see where temporal locality is exhibited by most of our codes, for example, loop counter is accessed in each iteration of the loop, an instruction in a loop gets executed in every iteration etc..

To exploit temporal locality, when a memory item is accessed, which is not already present in cache, it is brought in. This is done in the hope that it will be accessed again and it makes sense not to waste time in reading it from slow main memory again. Of course cache is not unlimited, so bringing in a new data item means getting rid of an old one. How exactly this replacement is done is an interesting topic in Operating Systems Design, not to be covered in this particular article.

Examples of Spatial locality are also easy to see, since most of our code runs in a sequential manner, executing one instruction after another and the program code is usually stored contiguously. In fact this is one of the reason why using “goto” statements is a discouraged programming practice. It doesn’t necessarily follow spatial locality, thus doesn’t exploit cache causing expensive main memory reads. Traversal through an array (which is stored sequentially) also exhibits spatial locality.

So how exactly cache uses the principle of spatial locality in memory accesses? Cache is divided into a chunk of memory called “Blocks”. A Block contains a fixed number of independent memory locations within itself. And once data from any memory location needs to be brought into cache, the whole block containing that memory location along with its neighbors is brought in, anticipating that its neighbors will be used in near future.

Also, a bit of terminology:

When an access made to a memory location can be satisfied by the cache, it is called a cache hit [desirable case]. And if that memory location is not present inside cache and has to be brought in from main memory, it is called a cache miss[undesirable case].

That summarizes the little knowledge about cache that we need to know to proceed further in our matrix multiplication example. Caches are in fact quite complex and more could be found about them in OS related literature.

Let’s see how can we make use of the cache properties to speed up our Matrix Multiplication program.

Assume that the 2-D array (matrix) is stored in a row-major order (as in C) :

a[0][0] , a[0][1], … , a[0][n-1], a[1][0], a[1][1], … , a[1][n-1] … a[n-1][n-1]

Hence matrices A, B and C are stored in memory as linear arrays.

First, look at the naive matrix multiplication code :

n = 4096;
for(i=0;i<n;i++)
{
    for(j=0;j<n;j++)
    {
        for(k=0;k<n;k++)
            C[i][j]+=A[i][k]*B[k][j];
    }
}

Note: n should be large enough so that cache is exhausted and effect of cache misses can be witnessed.

Assume each array element is stored in one memory location.

There are 3 array elements and thus 3 memory locations being accessed in each loop iteration:

C[i][j], A[i][k] & B[k][j]

The accesses into array C are made in the following order:

C[0][0] is accessed n times – For the first execution of k loop.

C[0][1] is accessed n times – For the second execution of k loop.

C[0][2] is accessed n times – For the third execution of k loop.

C[1][0] is accessed n times – For the (n+1)th execution of k loop.

C[n][n] is accessed n times – For the (n∗n)th execution of k loop.

Hence, each element of array C is accessed n times consecutively and then it is not referenced again. Both temporal and spatial locality of reference is observed (Since the same element is accessed repeatedly and accesses are made sequentially). So for array C, accesses are already optimized for cache!

The accesses into array A are made in the following order:

A[0][0], A[0][1], A[0][2], … , A[0][n] – This whole sequence n times (for each iteration of k loop under a single iteration of j loop).

A[1][0], A[1][1], A[1][2], … , A[1][n] – This whole sequence n times (for each iteration of k loop under a single iteration of j loop).

A[n][0], A[n][1], A[n][2], … , A[n][n] – This whole sequence n times (for each iteration of k loop under a single iteration of j loop).

Hence, each row of array A is accessed n times and each element in the row is accessed sequentially. We observe good spatial locality of reference. But the same element is not referenced repeatedly in succession and maybe replaced from cache before it is accessed again. If a row of array A can be stored in cache completely (without replacement) then for array A also cache accesses will be optimized.

The accesses into array B are made in the following order:

B[0][0], B[1][0], B[2][0], … , B[n][0]

B[0][1], B[1][1], B[2][1], … , B[n][1]

B[0][n], B[1][n], B[2][n], … , B[n][n]

Then again,

B[0][0], B[1][0], B[2][0], … , B[n][0]

B[0][1], B[1][1], B[2][1], … , B[n][1]

B[0][n], B[1][n], B[2][n], … , B[n][n]

And this whole sequence is repeated n times (for the n iterations of the i-loop). Elements of array B are accessed column wise. And the whole array is accessed once and only then repetition starts. So there’s no spatial locality or temporal locality for B. Only if whole of array B can be stored in cache, there will not be any misses. Even if one complete row can be stored in cache, each element will incur a cache miss on every access made to it. And thus we have a cache miss on every execution of the multiplication statement.

On an old machine in our Lab, this code takes about 51 minutes!

To improve upon this traditional i−j−k setting, we can employ loop interchange. By trying out the other 5 combinations of the loop orderings namely, i−k−j, j−k−i , j−i−k, k−i−j and k−j−i we find that i−k−j is the optimal. Take a look:

n = 4096;
for(i=0;i<n;i++)
{
    for(k=0;k<n;k++)
    {
        for(j=0;j<n;j++)
            C[i][j]+=A[i][k]*B[k][j];
     }
}

The accesses into array C are made in the following order:

C[0][0], C[0][1], C[0][2], … , C[0][n] – This whole sequence n times (for each iteration of k loop).

C[1][0], C[1][1], C[1][2], … , C[1][n] – This whole sequence n times (for each iteration of k loop).

… …

C[n][0], C[n][1], C[n][2], … , C[n][n] – This whole sequence n times (for each iteration of k loop).

C has good spatial locality and if a row can be completely stored in cache, then it has good temporal locality too.

The accesses into array A are made in the following order:

A[0][0] is accessed n times – For the first execution of j loop.

A[0][1] is accessed n times – For the second execution of j loop.

A[0][2] is accessed n times – For the third execution of j loop.

A[1][0] is accessed n times – For the (n+1)th execution of j loop.

A[n][n] is accessed n times – For the (n∗n)th execution of j loop.

Just like array C in the traditional i−j−k loop, here array A enjoys good temporal as well as spatial locality.

The accesses into array B are made in the following order:

B[0][0], B[0][1], B[0][2], … , B[0][n]

B[1][0], B[1][1], B[1][2], … , B[1][n]

B[n][0], B[n][1], B[n][2], … , B[n][n]

All of this is then repeated n times. So again, temporal locality will not be observed unless we are able to hold the whole array in cache, but unlike in previous case we are now able to access elements row wise and make use of spatial locality of reference. This causes cache misses to be reduced drastically, as we will not have consecutive misses on every element of the array.

As expected, this approach fares better and takes around 14 minutes on our old system. Try implementing the same, and increase the value of n if you want to appreciate this little trick even more!

So what we found here is that two algorithms with same complexity perform very differently because of architecture (here cache) aware programming or the lack of it.

You can carry out the analysis of the other orderings for yourself. And if you like you can also wonder about storing transpose of matrix B instead of directly storing it and how can that help in reducing cache misses!

[This will convert its row major storage into column major. But that will require changing the code a little bit to ensure correctness.]

Implement and please do share your findings! ^_^

Other optimizations can be carried out including Transpose, Blocking etc. specific to cache, which can further speed up our code. We will take them up in our next articles. Stay tuned!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s