Does Your Loop Order Matter?

Introduction

Imagine that you work in a research institution as a prgrammer and you are tasked to implement a matrix multiplication function. You refer to the mathematical expression and you find out that given that $A$ is a $m \times n$ matrix and $B$ is a $n \times p$ matrix then the multiplication can be expressed as in equation $(1)$. Obvioously when the sizes of the matrixes involved is large the nested loops are going to take a while to execute.

\[ \mathbf{A} = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{pmatrix} \] \[ \mathbf{B}=\begin{pmatrix} b_{11} & b_{12} & \cdots & b_{1p} \\ b_{21} & b_{22} & \cdots & b_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ b_{n1} & b_{n2} & \cdots & b_{np} \\ \end{pmatrix} \] \[ \tag{1} \mathbf{A} \times \mathbf{B} = \sum_{m=0}^{m=M}{\sum_{n=0}^{n=N}{\sum_{p=0}^{p=P}{a_{mn}b_{np}}}} \]

Equation $1$ can be expressed programatically as

//M is the number of rows in A
//N is the number of columns in A and number of rows in B
//P is the number of columns in B
//C is a vector of vector that contains the result of matrix multiplication

for (m=0;m<M;m++){
    for (n=0;n<N;n++){
        for (p=0;p<P;p++){
            C[m][p]+=A[m][n]*B[n][p];
        }
    }
}

The question we seek to answer is, in which order should we perform our loops? Since we have $3$ items to loop over the number of loop orders is going to be $_3\text{Pr}_3=\frac{3!}{(3-2)!}=6$. The permutations are expressed below; \[ \begin{array}{l} (M, N, P) \\ (M, P, N) \\ (N, M, P) \\ (N, P, M) \\ (P, M, N) \\ (P, N, M) \end{array} \] In the following sections we are going to run an experiment to determine which yields the best results and then analyze why.

Experiment

The experiment was implemented in C programming language because its simplicity and potability. The experiment is set up to run different implementations of matrix multiplication on the same set of input and the it takes to do that is recorded. The number of floating point operations involved in calculating the matrix multiplication, $A \times B$ is calculated as $m \times n \times (2p -1)$, where $m$ is the number of rows in $A$, $n$ is the number of columns in $A$ and $p$ is the number of columns in $B$. We calculated the floating point operations per seconds (FLOPs) as $\frac{\text{number of floating point operations}}{\text{duration}}$. We used Giga FLOPs in our experiment so we divided the FLOPs value by $1000000$.

The source was compiled with -O0 flag to disable compiler optimization. We set $A$ and $B$ as $1000 \times 5000$ and $5000 \times 1000$ matrixes respectively. The are initialized to random values. That meant that each loop order had to perform $9999000000$ floating point operations. Although the experiment was conducted on a multicore server we decided to limit the computation to one CPU.

Source Code

The source code can be found here on github.

Result

We observed that MPN gave the worst performance while PNM and NPM gave the best. This result was very consistent regardless of the size of the matrices involved unless the are too small to notice any performance difference.

Loop OrderGiga FLOPs
MNP46.617
MPN38.328
NMP45.876
NPM59.504
PMN39.004
PNM59.710

Analysis

The result performance of PNM and NPM is as a result of how the matrix is stored and retrieved in memory. The central processing unit (CPU) is much faster than most of the peripherals attached to it so if it is to retrieve information one after the other from them it is will underperform. Therefore there is a caching system built around it to preload data and make them available when it request for them. This is the cache heirarchy. On top of the hierarchy are registers. Registers are the fastest and exist in the CPU core. This is where data is stored before they are operated on by the processor. Between the registers and the main memory are series of caches called Level 1 (L1), Level 2 (L2) and Level 3 (L3) caches. L1 cache has a larger capacity than the registers and is usually present within the CPU core. It is usually made up of Static Random Access Memory (SRAM). It is typically the faster and smaller in capacity than the L2 and L3 caches. The L2 cache is larger than L1 but smaller than L3 and is also usually located within the CPU core. L3 is usually located inside the CPU module but may be shared between multiple cores. It has a lower storage capacity than the main memory.

Figure 1. A diagram of the cache hierarchy

During program execution these caches allow data be stored in fast memory to processed. Data may be cached or evicted based on temporal or spatial locallity. Temporal locallity suggests that when data is used by the CPU then it is very likely that it is going to be used again so we need to cache it. Spatial locallity suggests that when data is accessed by the CPU then it is very likely that the data in the next address will be used. When the data needed is not present in the cache it takes extra cycles to retrieve from the main memory. Softwares that take advantage of the cach hierarchy observes a higher performance than those that do not.

In our case we observe that although the matrixes involved are $2$-dimensional in nature, the data stored in memory as a linear array. This is true in most cases because memory is logically a series of addressible storage cells. Therefore in order to implement a matrix in this setting we have to make a decision on how to store rows and columns. This leads us to two popular arangements, row major and column major. Given the $4 \times 4$ matrix below,

\[ \begin{pmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \\ \end{pmatrix} \]

a row major representation in memory will look like this \[ [a_{11},a_{12},a_{13},a_{14}, a_{21},a_{22},a_{23},a_{24}, a_{31},a_{32},a_{33},a_{34}, a_{41},a_{42},a_{43},a_{44}] \] while the column major will be \[ [a_{11},a_{21},a_{31},a_{41}, a_{12},a_{22},a_{32},a_{42}, a_{13},a_{23},a_{33},a_{43}, a_{14},a_{24},a_{34},a_{44}] \]

Our experiment implemented a column major matrix with the _mat_get_col_maj_index function. It converts a zero-indexed row and column number and to the index of element in the array. Since data is stored along columns it means that in theory inner loops along the column will have successive data to be used loaded in cache and therefore will yield higher performance (Flops). Given that $A$ is $M × N$ , $B$ is $N \times P$ and $C$ is $M \times P$, PNM and NPM , will yield a better performance because their inner most loop is along the columns of $A$ and $C$.

size_t _mat_get_col_maj_index(matrix_t *m, size_t row, size_t col)
{
    size_t index = ((m->n_rows * col) + row) * sizeof(float);
    return index;
}

Conclusion

From the experiment and results we have been able to demonstrate that loop order matters especially when large data is involved. We can also extend our result to deciding whether it is prudent to have an array of structs or object or multiple arrays of fields.

References