Optimizing Matrix-Matrix Multiplication

Deanna Rogers

ECS 289K
Project A

Fall 2003

1 Introduction

Several techniques exist for optimizing matrix-matrix multiplication. Some common techniques focus on fully utilizing the memory hierarchy [3] and speeding up the innermost loop [2]. [1] presents several other techniques. We implemented several of these including blocking, copy optimization, loop unrolling, avoiding magnitude compares, and several others.

All tests described herein were done on an IBM Thinkpad R31 with a 1 GHz Intel Mobile Pentium III. It also has a 32K L1 cache, 16K for instructions and 16K for data, and a 512K L2 cache. Both caches are 4-way set associative.

2 Memory Hierarchy Optimizations

2.1 Blocking

Blocking is a common divide-and-conquer technique for using the memory hierachy effectively. Since the cache may only be large enough to hold a small piece of one matrix, the data has already been kicked out of the cache before it is reused. The processor will thus continually be forced to access slower levels of memory, decreasing the algorithm's performance. With blocking, however, each matrix is divided into blocks of smaller matrices, and the algorithm multiplies two submatrices, storing their product before moving on to the next two submatrices. This better exploits cache locality so that data in the cache can be reused before being replaced.

The level of performance benefit from blocking depends on the size of the blocks and the cache sizes. We used the provided implementation of blocking to get an idea of how block size (B) affects performance. The graph below illustrates the performance of several different block sizes on the Thinkpad.

Performance of various block sizes

Notice that larger block sizes are not always the best choice. Once the block size reaches a certain point, performance begins to decrease as block size increases. There are several reasons for this. First, interference (sometimes also called conflict) cache misses are occurring. These types of cache misses can occur when multiple rows of a matrix are mapped to the same lines in the cache, making it impossible to fully utilize the cache [3]. Secondly, as blocks get too large to fit into faster memory, the processor is again forced to access slower levels of the memory hierarchy when reusing data.

Although initially not the best, we chose to work with a block size of 36 because it turned out to be faster after we implemented copy optimization (see section 2.2). We believe this is because an entire block can fit into the L1 cache, which can be accessed the faster than the other levels of the memory hierarchy.

2.2 Copy Optimization

Copy optimization can help decrease the number of conflict cache misses. As mentioned above, conflict cache misses occur when multiple data items are mapped to the same location in the cache. With blocking, this means that cached data may be prematurely kicked out of the cache. Furthermore, as mentioned in [3], conflict misses can cause severe performance degradation when an array is accessed with a constant, non-unit stride. In the provided matrix-matrix multiplication implementation, the matrix A is accessed in this way (specifically, we access A in strides of 'lda'). This is a result of the way the matrix is stored. The matrices are stored in a one-dimensional array, with the first column occupying the first M entries in the array, where the matrix is MxM. The second column is stored in the next M entries, and so on. Thus, consecutive elements in a matrix row are M entries apart in the array, and our matrix multiplication routine is forced to access the matrix A in an M-unit stride. In order to improve upon this, we re-order the matrix A so that row elements are stored in consecutive entries in the array (i.e., the first row is stored in the first M entries of the array, the second row is stored in the next M entries of the array, and so on). This re-ordering is sometimes also called copy optimization. Now, both A and B are accessed in unit-strides.

Another advantage to accessing the matrices in unit strides is that it utilizes the cache more fully. When a requested data item results in a cache miss, not only is the requested item loaded into the cache, but also all adjacent items that can fit on the same cache line. By accessing the array in a unit-stride, we take better advantage of this feature.

Conflict cache misses are also responsible for large dips in the performance of the basic (non-blocked) dgemm routine when the matrix size is 256 or 512 (see the graph below). In these cases, we are accessing memory in strides of large powers of two, significantly reducing the possible cache locations for a matrix row. As Bindel notes in [2], it is as if the cache has been reduced from its actual size to only a few entries.

No optimizations

3 Inner Loop Optimizations

As pointed out in [2], optimizations should focus on the places in the code where the most time is spent. In our matrix-matrix multiplication implementation, this is the innermost loop.

3.1 Minimized Branches and Avoidance of Magnitude Compares

According to [1], C 'do-while' loops tend to perform better than C 'for' loops because compilers tend to produce unnecessary loop head branches in 'for' loops. Furthermore, also noted in [1], it is often cheaper to do equality or inequality tests in loop conditions than magnitude comparision tests. Thus, we translated the innermost 'for' loop into a 'do-while' loop, and used pointer inequality rather than magnitude comparision to test for loop termination. The code below exemplifies this technique.

The original code that looks something like this:

for (k = 0; k < BLOCK_SIZE; k++) { ... }

is translated into something like this:

end = &B_j[BLOCK_SIZE];
do {
...
}
while (B_j != end);

3.2 Explicit Loop Unrolling

Although the compiler option '-funroll-all-loops' is used in the Makefile provided, we decided to see if unrolling the innermost loop by hand would improve upon the compiler's optimization. According to [1], explicitly unrolling loops can increase opportunities for other optimizations. The graph below shows the performance of the matrix-matrix multiply routine with the innermost loop manually unrolled 2, 3, 4, and 6 times.

Inner loop unrolled 0, 2, 3, 4, and 6 times

Since all other amounts resulted in a decrease in performance, our final optimization set includes explicitly unrolling the innermost loop twice.

4 Other Optimizations

In addition to those described above, we also used some compiler optimization flags and utilized the fact that for the majority of the time, we know exactly how many iterations we'll need. The following compiler optimization options were used (all except -dalign were used in the provided Makefile):

Most of the time, we know exactly how many loop iterations we'll need at compile time- we'll need B iterations, where B is the block size. There are two cases: we may be operating on a full BxB block, in which case we know we'll need exactly B loop iterations, or we may be operating on a smaller block. The latter situation arises when the matrix itself is smaller than BxB or when the matrices block unevenly, and we are left with smaller "fringe" blocks. We treat the two cases differently: the first case (we are operating on a full BxB block) is handled by code with a static loop size (B), which is known at compile time. The second situation (we are processing a block smaller than BxB) is still handled by the original, general loop code. Knowing the number of loop iterations at compile time allows the compiler to optimize more efficiently, and thus speeds up the code. The performance gain is illustrated in the graph below.

Static loop size vs. Naive blocking

5 Final Results

Our final results are displayed in the graph below.

All optimizations

Since most of our optimizations focused on the full, square BxB blocks, our optimizations have little positive effect on small matrices (less than BxB). For fringe blocks and small matrices, we use the general algorithm, which is not as fast. As the size of the matrices grows, the time spent processing square submatrices dominates the time spent processing fringe blocks, and our performance graph becomes smoother.

6 References

[1] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel, The PHiPAC v1.0 Matrix-Multiply Distribution, International Computer Science Institute, Berkeley, CA, October 1998.

[2] David Bindel and David Garmire, "Matrix Multiply Optimizations," http://www.nersc.gov/~simon/cs267/hw1note/optimization.pdf,September 2002.

[3] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf, "The Cache Performance and Optimizations of Blocked Algorithms," Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA, April 1991.