1 Introduction
Several techniques exist for optimizing matrix-matrix multiplication.
Some common techniques focus on fully utilizing the memory hierarchy
[3] and speeding up the innermost loop [2]. [1] presents
several other techniques. We implemented several of these including
blocking, copy optimization, loop unrolling, avoiding magnitude
compares, and several others.
All tests described herein were done on an IBM Thinkpad R31 with a
1 GHz Intel Mobile Pentium III. It also has a 32K L1 cache, 16K for
instructions and 16K for data, and a 512K L2 cache. Both caches are
4-way set associative.
2 Memory Hierarchy Optimizations
2.1 Blocking
Blocking is a common divide-and-conquer technique for using the
memory hierachy effectively. Since the cache may only be large enough
to hold a small piece of one matrix, the data has already been kicked
out of the cache before it is reused. The processor will thus
continually be forced to access slower levels of memory, decreasing
the algorithm's performance. With blocking, however, each matrix is
divided into blocks of smaller matrices, and the algorithm multiplies
two submatrices, storing their product before moving on to the next
two submatrices. This better exploits cache locality so that data in
the cache can be reused before being replaced.
The level of performance benefit from blocking depends on the size of
the blocks and the cache sizes. We used the provided implementation
of blocking to get an idea of how block size (B) affects performance.
The graph below illustrates the performance of several different block
sizes on the Thinkpad.
Notice that larger block sizes are not always the best choice. Once the block size reaches a certain point, performance begins to decrease as block size increases. There are several reasons for this. First, interference (sometimes also called conflict) cache misses are occurring. These types of cache misses can occur when multiple rows of a matrix are mapped to the same lines in the cache, making it impossible to fully utilize the cache [3]. Secondly, as blocks get too large to fit into faster memory, the processor is again forced to access slower levels of the memory hierarchy when reusing data.
Although initially not the best, we chose to work with a block size of 36 because it turned out to be faster after we implemented copy optimization (see section 2.2). We believe this is because an entire block can fit into the L1 cache, which can be accessed the faster than the other levels of the memory hierarchy.
2.2 Copy Optimization
Copy optimization can help decrease the number of conflict cache misses. As mentioned above, conflict cache misses occur when multiple data items are mapped to the same location in the cache. With blocking, this means that cached data may be prematurely kicked out of the cache. Furthermore, as mentioned in [3], conflict misses can cause severe performance degradation when an array is accessed with a constant, non-unit stride. In the provided matrix-matrix multiplication implementation, the matrix A is accessed in this way (specifically, we access A in strides of 'lda'). This is a result of the way the matrix is stored. The matrices are stored in a one-dimensional array, with the first column occupying the first M entries in the array, where the matrix is MxM. The second column is stored in the next M entries, and so on. Thus, consecutive elements in a matrix row are M entries apart in the array, and our matrix multiplication routine is forced to access the matrix A in an M-unit stride. In order to improve upon this, we re-order the matrix A so that row elements are stored in consecutive entries in the array (i.e., the first row is stored in the first M entries of the array, the second row is stored in the next M entries of the array, and so on). This re-ordering is sometimes also called copy optimization. Now, both A and B are accessed in unit-strides.
Another advantage to accessing the matrices in unit strides is that it utilizes the cache more fully. When a requested data item results in a cache miss, not only is the requested item loaded into the cache, but also all adjacent items that can fit on the same cache line. By accessing the array in a unit-stride, we take better advantage of this feature.
Conflict cache misses are also responsible for large dips in the performance of the basic (non-blocked) dgemm routine when the matrix size is 256 or 512 (see the graph below). In these cases, we are accessing memory in strides of large powers of two, significantly reducing the possible cache locations for a matrix row. As Bindel notes in [2], it is as if the cache has been reduced from its actual size to only a few entries.
3 Inner Loop Optimizations
As pointed out in [2], optimizations should focus on the places
in the code where the most time is spent. In our matrix-matrix
multiplication implementation, this is the innermost loop.
3.1 Minimized Branches and Avoidance of Magnitude Compares
According to [1], C 'do-while' loops tend to perform better than
C 'for' loops because compilers tend to produce unnecessary loop head
branches in 'for' loops. Furthermore, also noted in [1], it is
often cheaper to do equality or inequality tests in loop conditions
than magnitude comparision tests. Thus, we translated the innermost
'for' loop into a 'do-while' loop, and used pointer inequality rather
than magnitude comparision to test for loop termination. The code
below exemplifies this technique.
The original code that looks something like this:
for (k = 0; k < BLOCK_SIZE; k++) { ... }
is translated into something like this:
end = &B_j[BLOCK_SIZE];
3.2 Explicit Loop Unrolling
Although the compiler option '-funroll-all-loops' is used in the
Makefile provided, we decided to see if unrolling the innermost loop
by hand would improve upon the compiler's optimization. According to
[1], explicitly unrolling loops can increase opportunities for
other optimizations. The graph below shows the performance of the
matrix-matrix multiply routine with the innermost loop manually
unrolled 2, 3, 4, and 6 times.
do {
...
}
while (B_j != end);
Since all other amounts resulted in a decrease in performance, our final optimization set includes explicitly unrolling the innermost loop twice.
4 Other Optimizations
In addition to those described above, we also used some compiler
optimization flags and utilized the fact that for the majority of the
time, we know exactly how many iterations we'll need. The following
compiler optimization options were used (all except -dalign were used
in the provided Makefile):
Most of the time, we know exactly how many loop iterations we'll need
at compile time- we'll need B iterations, where B is the block size.
There are two cases: we may be operating on a full BxB block, in which
case we know we'll need exactly B loop iterations, or we may be
operating on a smaller block. The latter situation arises when the
matrix itself is smaller than BxB or when the matrices block unevenly,
and we are left with smaller "fringe" blocks. We treat the two cases
differently: the first case (we are operating on a full BxB block) is
handled by code with a static loop size (B), which is known at compile
time. The second situation (we are processing a block smaller than
BxB) is still handled by the original, general loop code. Knowing the
number of loop iterations at compile time allows the compiler to
optimize more efficiently, and thus speeds up the code. The
performance gain is illustrated in the graph below.
5 Final Results
Our final results are displayed in the graph below.
Since most of our optimizations focused on the full, square BxB blocks, our optimizations have little positive effect on small matrices (less than BxB). For fringe blocks and small matrices, we use the general algorithm, which is not as fast. As the size of the matrices grows, the time spent processing square submatrices dominates the time spent processing fringe blocks, and our performance graph becomes smoother.
6 References
[1] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel,
The PHiPAC v1.0 Matrix-Multiply Distribution, International
Computer Science Institute, Berkeley, CA, October 1998.
[2] David Bindel and David Garmire, "Matrix Multiply
Optimizations,"
http://www.nersc.gov/~simon/cs267/hw1note/optimization.pdf,September
2002.
[3] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf, "The
Cache Performance and Optimizations of Blocked Algorithms,"
Proceedings of the Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, Palo
Alto, CA, April 1991.