how matrix multiplication works using the T-Head Matrix Extension as an example

YADRO. A month ago, my colleague Valeria launched a series of articles about matrix extensions that speed up operations on matrices. You have already learned what they do and what types exist, and which of them are being developed for the open RISC-V architecture.

In the final article of the series, we will analyze an example of using the T-Head matrix extension under RISC-V to implement a matrix multiplication algorithm. First, we will briefly consider the naive scalar implementation and the block version of the algorithm. Then we will implement a similar version using the matrix extension – both for square matrices and for matrices of arbitrary size. The second case is interesting because it requires processing the so-called “tails” – blocks of incorrect configuration. In conclusion, I will tell you a little about the ideas that can be used for further optimization of matrix multiplication, and share useful links.

The article does not show step-by-step optimization of matrix multiplication to achieve maximum FLOPS, nor does it teach how to write computational kernels in assembler. It demonstrates the use of matrix expansion and the main ideas of matrix multiplication optimization. I tried to describe everything in simple words, with illustrations and small code inserts.

If you haven't read the previous parts, I'm sharing the links:

→ About matrix extensions in general: why they are needed, how they work and what types exist.

→ Extensions under development for RISC-V.

→ More about matrix expansion from T-Head.

Working with the emulator

T-Head's independent matrix extension has not yet been implemented in hardware. The only way to run code with matrix instructions is to use a simulator, in our case QEMU. It allows you to run RISC-V code on the x86 architecture. In addition, you will need a RISC-V toolchain for assembly. You can find everything you need on T-Head's GitHub: download GCC toolchain for cross-compilation and QEMU.

For ease of working with the matrix extension, our team at YADRO has created a separate repository on GitHub with convenient environment setup, build and run tests on QEMU. It contains a library with its own tests and CI. The only thing missing is performance tests, but they are meaningless for QEMU.

To work with the repository, you need to complete several steps:

  • Set up the environment using a script env.sh.

  • Build a project using a script build.sh.

  • Run on QEMU.

./tools/qemu/bin/qemu-riscv64 -cpu c907fdvm-rv64 
./_build/test/test_rvm_square

Where c907fdvm-rv64 – CPU emulation with matrix expansion support.

All code from this article can be found in the given repositories. In addition, it will be useful for independent practice: you can try to implement a more efficient version of matrix multiplication, for a different data type, and also add implementations for other linear algebra algorithms.

After preparing the repository and a series of articles, a new version of the intrinsic API for matrix extension was released. The main differences are in changing the prefix in the names and adding block sizes to the function arguments. In this article, we will consider the old version, but the updated implementation can be viewed in a separate thread our repository. To build, you need to download a newer toolchain version 2.10.1 from Xuantie website (you need to register to download).

Scalar version of matrix multiplication

Unlike GEMM (GEneral Matrix Matrix multiplication) BLAS-functionality, our version of matrix multiplication will be without scalar coefficients alpha And betaand also without arguments lda, ldb And ldc.

We'll start with the most naive algorithm with three cycles:

void gemm_ref(const float *A, const float *B, float *C, const size_t n, const size_t m, const size_t k)
{
    for (size_t i = 0; i < n; i++) { /* Loop over the rows of C */
	    for (size_t j = 0; j < k; j++) { /* Loop over the columns of C */
	        for (size_t p = 0; p < m; p++) {  /* Update C( i,j ) with the     inner
	                                             product of the ith row of A and
	                                             the jth column of B */
	            C[i * k + j] += A[i * m + p] * B[p * k + j];
	        }
	    }
    }
}

The first two loops are iteration over the matrix Cwe go through each element, and the third loop is the scalar product of the row vector A to the column vector B. Parameters are pointers to matrices A, B and C and three dimensions of the matrices being multiplied.

We go through the elements of the matrix C and calculate one scalar product.

But of course this implementation works very slowly. Faster ones are based on block matrix multiplication algorithm.

#define BLOCK_SIZE 4
 
static inline void process_block_4x4(const size_t m, const float *A, const size_t lda, const float *B, const size_t ldb, float *C, const size_t ldc) {
    for (size_t i = 0; i < BLOCK_SIZE; i += 1) {
        for (size_t j = 0; j < BLOCK_SIZE; j += 1) {
            for (size_t p = 0; p < m; p += 1) {
                C[(j * ldc) + i] += A[(j * lda) + p] * B[(p * ldb) + i];
            }
        }
    }
}
 
void gemm_block4x4_ref(const float *A, const float *B, float *C, const size_t n, const size_t m, const size_t k)
{
    for (size_t j = 0; j < k; j += BLOCK_SIZE) { /* Loop over the columns of C */
        for (size_t i = 0; i < n; i += BLOCK_SIZE) { /* Loop over the rows of C */
            process_block_4x4(m, &A[i*m], m, &B[j], k, &C[(i*k) + j], k);
        }
    }
}

Here we have the cycles we are already familiar with, only now we are iterating not by the elements of the matrix WITHbut by its blocks. And already in the function process_block_4x4() one block of the matrix is ​​calculated C. The following are passed to the function:

  • dimension m — the length of the scalar product,

  • A, B, C — pointers to matrix blocks,

  • lda, ldb And ldc — leading dimension, the number of elements up to the next row of the matrix.

The question may arise: is such a block algorithm needed and what is its advantage over the naive algorithm. The indicative answer is in the number of loading operations. For the naive algorithm, to calculate one row of the matrix Cconsisting in our example of 8 elements, we need one row of the matrix A and the whole matrix B. This results in 64 downloads.

In case of block algorithm for calculating matrix block WITHwhich has a size of 4×2 (that is, also from 8 elements), we already need two rows of the matrix A (16 elements) and already half of the matrix B. In total – 32 downloads.

As a result, to calculate the same number of matrix elements WITH we get 80 downloads (naive algorithm) versus 56 downloads (block algorithm).

It seems that the difference is not that big, but we were looking at small matrices. For large matrices – for example, with a matrix row WITH from 1024 elements or a 32×32 block – in similar calculations the difference will be colossal: 1,050,624 downloads versus 66,560.

Thus, by reducing the number of downloads in the block algorithm, we get greater cache locality. It also contributes to efficient vectorization and parallelization of the algorithm. But this version of the algorithm is also far from ideal.

To practice with matrix expansion, we will take the considered version of the block algorithm as a basis, and at the end of the article we will discuss how to implement matrix multiplication more efficiently.

Matrix implementation of the matrix multiplication algorithm

We have dealt with scalar implementations of matrix multiplication. Now let's proceed to the implementation of the block algorithm using matrix expansion. Implementation gemm_block4x4_rvm() is similar to the scalar version for now. We iterate over the blocks, call the function process_block_4x4()which completely calculates one block of the matrix C size 4×4, we pass the length of the scalar product into it and already in it we calculate 16 scalar products of this block.

#define BLOCK_SIZE 4
static inline void process_block_4x4(const size_t m,
                                     const float *A, const size_t lda,
                                     const float *B, const size_t ldb,
                                           float *C, const size_t ldc);
 
void gemm_block4x4_rvm(const float *A, const float *B, float *C,
                       const size_t n, const size_t m, const size_t k)
{
    size_t blocks_n = (n / BLOCK_SIZE) * BLOCK_SIZE;
    size_t blocks_k = (k / BLOCK_SIZE) * BLOCK_SIZE;
    for (i = 0; i < blocks_n; i += BLOCK_SIZE) {
	    for (j = 0; j < blocks_k; j += BLOCK_SIZE) {
	        process_block_4x4(m,
                              &A[i*m], m,
                              &B[j], k,
                              &C[(i*k) + j], k);
	    }
    }
}

How are we going to implement this function? First, we need to load the matrix block C, in which we will accumulate the result.

But before using the load instruction, we need to configure the matrix register. In the function process_block4х4() We, as in the scalar version, process fixed 4×4 blocks. The block size was not chosen by chance. This is the maximum size with a matrix register width of 128 bits.

Let's configure these blocks with functions mcfgm(), mcfgn() And mcfgk(). After configuration, we call the instruction to load this block mld_f32()passing a pointer to the matrix WITH and stride ldc * sizeof(float) — distance to the next block row in bytes.

static inline void process_block_4x4(const size_t m,
                                     const float *A, const size_t lda,
                                     const float *B, const size_t ldb,
                                           float *C, const size_t ldc) {
    const size_t blocks_m = (m / BLOCK_SIZE) * BLOCK_SIZE;
 
    // Rows of matrix
    mcfgm(BLOCK_SIZE);
    // Rows of matrix B when multiplying
    mcfgn(BLOCK_SIZE);
    // Cols of matrix in bits
    mcfgk(BLOCK_SIZE * sizeof(float));
 
    // Load C (4 x 4) block
    mfloat32_t mc = mld_f32(C, ldc * sizeof(float));
 
    size_t block_m;
    for (block_m = 0; block_m < blocks_m; block_m += BLOCK_SIZE) { ... }
    ...
}

The next step after loading the matrix is ​​to implement a loop with loading blocks of matrices to be multiplied and multiplying them.

To load the block A we just call the load instruction with stride lda * sizeof(float) , which corresponds to the number of columns of the matrix A. We don't need to reconfigure the matrix register because we've already done this for the 4×4 blocks. To load the matrix Bhowever, we first need to transpose it and load it into a separate 4×4 buffer. And then load it into the matrix register with the instruction mld_f32()with stride 4 * sizeof(float).

Why is this transposition necessary? At first glance, this is rudimentary. We do matrix multiplication, and for some reason we also transpose – all of this is unnecessary loading and saving operations. The point is that the specification requires it. On GitHub the description of this instruction clearly states that it multiplies a matrix block A to the transposed block of the matrix B. We will tell you why this is necessary a little later, but for now we will simply obey this requirement: they said to transpose – you need to transpose.

After loading the blocks, we can multiply them using the instruction fmmacc_mf32().

Again, we don't need to reconfigure the matrix register, everything is already configured to size 4. After we multiply, we advance our pointers: pointer ptr_a by 4, because we move from left to right, and the pointer ptr_b we are moving on ldb * 4as we move from top to bottom.

...
    const float *ptr_a = A;
    const float *ptr_b = B;
 
    float block_b[BLOCK_SIZE * BLOCK_SIZE];
    const size_t stride_a = lda * sizeof(float);
 
    size_t block_m;
    for (block_m = 0; block_m < blocks_m; block_m += BLOCK_SIZE) {
	    // Load A (4 x 4) block
	    mfloat32_t ma = mld_f32(ptr_a, stride_a);
 
	    // Transpose B (4 x 4) block
	    for (int row = 0; row < BLOCK_SIZE; row++) {
	        for (int col = 0; col < BLOCK_SIZE; col++) {
	            block_b[col*BLOCK_SIZE+row] = ptr_b[(row * ldb) + col];
	        }
	    }
 
	    // Load B (4 x 4) block
	    mfloat32_t mb = mld_f32(block_b, BLOCK_SIZE * sizeof(float));
 
	    // C (4 x 4) += A (4 x 4) * B (4 x 4) ^ T
	    mc = fmmacc_mf32(mc, ma, mb);
 
        ptr_a += BLOCK_SIZE;
        ptr_b += BLOCK_SIZE * ldb;
    }
 
    // Store C (4 x 4) block
    mst_f32_mf32(C, ldc * sizeof(float), mc);
}

Once we have exited the loop and multiplied all the blocks, we must save the matrix block Cin which the products of blocks were accumulated A and blocks B. This is done in the same way as loading, only with the save instruction mst_f32_mf32().

Multiplication of matrices of arbitrary size

But what if the matrix dimensions are not multiples of 4? For example, the length of the row A or column B is not a multiple of 4, which does not correspond to the algorithm described above.

Let's introduce a new variable tail_mwhich denotes the length of the “tail” of our scalar product, which we must calculate separately. To do this, we will make one branch and, if this “tail” exists, we will also load these smaller blocks.

To load the matrix A only the number of columns in bytes needs to be reconfigured, which is tail_m * sizeof(float)since we still have 4 rows, but the number of columns can be from 1 to 3. For the matrix B everything is similar to how it is done in the main loop – except for changing one dimension of the block. Copy the block tail_m x 4 with transposition into a temporary buffer, a block is obtained 4 x tail_m. There is no need to reconfigure the matrix register, since the dimensions coincide with the matrix block. A.

Now we have to multiply the two loaded blocks. There is no need to reconfigure the matrix register, because we specified the number of columns in the previous step, and we configured everything else in advance, the block C We have 4×4.

 const size_t blocks_m = (m / BLOCK_SIZE) * BLOCK_SIZE;
    const size_t tail_m = m - blocks_m;
    ...
    for (block_m = 0; block_m < blocks_m; block_m += BLOCK_SIZE) { ... }
 
    if (tail_m) {
        mcfgk(tail_m * sizeof(float));
	    // Load A (4 x tail_m) block
	    mfloat32_t ma = mld_f32(ptr_a, stride_a);
 
	    // Transpose B (tail_m x 4) block
	    for (int row = 0; row < tail_m; row++) {
	        for (int col = 0; col < k; col++) {
	            block_b[col*BLOCK_SIZE+row] = ptr_b[(row * ldb) + col];
	        }
	    }
 
	    // Load B (4 x tail_m) block
	    mfloat32_t mb = mld_f32(block_b, BLOCK_SIZE * sizeof(float));
 
	    // C (4 x 4) += A (4 x tail_m) * B (4 x tail_m) ^ T
	    mc = fmmacc_mf32(mc, ma, mb);
    }
 
    mcfgk(BLOCK_SIZE * sizeof(float));
    // Store C (4 x 4) block
    mst_f32_mf32(C, ldc * sizeof(float), mc);
}

After this, you need to slightly change the saving of the matrix block C. Namely, reconfigure the size of the C block for saving, since the dimension was changed during tail processing.

But our work is not finished here, because we have three dimensions, not one, and the dimensions N or K may not be multiples of 4. It is necessary to take into account cases when in the processed block of the matrix C or the number of columns, or the number of rows, or the number of both rows and columns is less than four.

To do this, we change the function process_block_4x4()in which the calculation of scalar products occurs. Now it will be called process_block_nxk() and work with any number of rows and columns in a block. For the 4×4 block size we looked at earlier, nothing will change. We will pass the 4×4 size to the function, and everything will work as before. But we need to handle the “tails” by N And Kso for clarity we will divide the function calls into four cases with different block sizes.

void gemm_block4x4_rvm(const float *A, const float *B, float *C, const size_t n, const size_t m, const size_t k)
{
    size_t blocks_n = (n / BLOCK_SIZE) * BLOCK_SIZE;
    size_t blocks_k = (k / BLOCK_SIZE) * BLOCK_SIZE;
 
    size_t tail_n = n - blocks_n;
    size_t tail_k = k - blocks_k;
 
    size_t i;
    size_t j;
    for (i = 0; i < blocks_n; i += BLOCK_SIZE) {
        for (j = 0; j < blocks_k; j += BLOCK_SIZE) {
            process_block_nxk(BLOCK_SIZE, BLOCK_SIZE, m, &A[i*m], m, &B[j], k, &C[(i*k) + j], k);
        }
        if (tail_k != 0) {
            process_block_nxk(BLOCK_SIZE, tail_k, m, &A[i*m], m, &B[j], k, &C[(i*k) + j], k);
        }
    }
    if (tail_n != 0) {
        for (j = 0; j < blocks_k; j += BLOCK_SIZE) {
            process_block_nxk(tail_n, BLOCK_SIZE, m, &A[i*m], m, &B[j], k, &C[(i*k) + j], k);
        }
        if (tail_k != 0){
            process_block_nxk(tail_n, tail_k, m, &A[i*m], m, &B[j], k, &C[(i*k) + j], k);
        }
    }
}

Now we need to change the implementation of the calculation function process_block_nxk() to work with different block sizes. It will be very similar to the previous version of the function, only we will configure the matrix register differently. Let's look at the example of the very last case – the block kxnbecause if we implement a function for it, we will implement it for the other cases too.

Let's configure the matrix register by block size Cwhich, as before, must first be downloaded.

static inline void process_block_nxk(const size_t n, const size_t k, const size_t m,
                                     const float *A, const size_t lda,
                                     const float *B, const size_t ldb,
                                           float *C, const size_t ldc) {
    const size_t blocks_m = (m / BLOCK_SIZE) * BLOCK_SIZE;
 
    mcfgm(n);
    mcfgk(k * sizeof(float));
    // Load C (n x k) block
    mfloat32_t mc = mld_f32(C, ldc * sizeof(float));
 
    size_t block_m;
    for (block_m = 0; block_m < blocks_m; block_m += BLOCK_SIZE) { ... }
 
    if (tail_m) { ... }
 
    // Store C (n x k) block
    ...
}

Next comes the main block multiplication cycle, where we haven't reached the “tail” yet. That is, the number of columns in the matrix A in this cycle we have 4, the number of lines can already change depending on the block size. Before loading the matrix block A we need to configure the matrix register to size Nх4. With matrix B same thing: we transpose it and copy it to a temporary buffer. We configure only one dimension, since the dimension block Kх4and load.

After loading – multiplication of two loaded blocks, for which it is necessary to reconfigure the matrix register accordingly relative to the matrix block C size NxKwhich we want to calculate. The number of columns of the matrix A And B we configured when loading blocks A and Bit remains with us.

...
    for (block_m = 0; block_m < blocks_m; block_m += BLOCK_SIZE) {
	    mcfgm(n);
	    mcfgk(BLOCK_SIZE * sizeof(float));
	    // Load A (n x 4) block
	    mfloat32_t ma = mld_f32(ptr_a, stride_a);
 
	    // Transpose B (4 x k) block
	    for (int row = 0; row < BLOCK_SIZE; row++) {
	        for (int col = 0; col < k; col++) {
	            block_b[col*BLOCK_SIZE+row] = ptr_b[(row * ldb) + col];
	        }
	    }
	    mcfgm(k);
	    // Load B (k x 4) block
	    mfloat32_t mb = mld_f32(block_b, BLOCK_SIZE * sizeof(float));
 
	    mcfgm(n);
	    mcfgn(k);
	    // C (n x k) += A (n x 4) * B (k x 4) ^ T
	    mc = fmmacc_mf32(mc, ma, mb);
 
        ptr_a += BLOCK_SIZE;
        ptr_b += BLOCK_SIZE * ldb;
    }
    if (tail_m) { ... }
    ...
}

After the blocks are multiplied in the main loop, the “tail” is processed – the smallest blocks are processed. For the matrix block A the number of lines remains the same Nbut the number of columns is now not fixed at 4, but tail_mthat is, our “tail”. For the matrix B all the same. The number of columns in the block after transposition is now tail_mnot 4.

Next, multiplication of blocks. Number of columns of the matrix A and matrices B configured when loading blocks, and the number of rows and columns of the matrix C needs to be reconfigured. By multiplying the blocks, we configure the matrix register size by the block column size WITHShe used to be equal. tail_mshould be now k. Save the matrix block C.

...
    for (block_m = 0; block_m < blocks_m; block_m += BLOCK_SIZE) { ... }
 
    if (tail_m) {
	    mcfgm(n);
	    mcfgk(tail_m * sizeof(float));
	    // Load A (n x tail_m) block
	    mfloat32_t ma = mld_f32(ptr_a, stride_a);
 
	    // Transpose B (tail_m x k) block
	    for (int row = 0; row < tail_m; row++) {
	        for (int col = 0; col < k; col++) {
	            block_b[col*BLOCK_SIZE+row] = ptr_b[(row * ldb) + col];
	        }
	    }
 
	    mcfgm(k);
	    // Load B (k x tail_m) block
	    mfloat32_t mb = mld_f32(block_b, BLOCK_SIZE * sizeof(float));
 
        mcfgm(n);
	    mcfgn(k);
	    // C (n x k) += A (n x tail_m) * B (k x tail_m) ^ T
	    mc = fmmacc_mf32(mc, ma, mb);
    }
 
    mcfgk(k * sizeof(float));
    // Store C (n x k) block
    mst_f32_mf32(C, ldc * sizeof(float), mc);
}

Next steps in optimization

We can say that this completes the implementation of matrix multiplication using the T-Head extension for matrices of arbitrary size. But will it work fast? Until we run this implementation on real hardware, we will not know. But optimizing matrix multiplication is not only about using vector or matrix instructions, but also about optimizing memory accesses.

In previous articles in the series about matrix extensions, Valeria mentioned more than once work Goto and Geijn. The main idea of ​​this article is multi-level blocking relative to caches and reordering of elements. What does this mean?

It is no secret that the processor has a multi-level memory hierarchy. The fastest memory is the registers, but there are the fewest of them. Next comes the L1 cache, then L2, L3 and RAM. Currently, our implementation does not use either the registers or the processor caches effectively. If it is clear with the registers – we need to increase the block size and use all 8 registers (we currently use 3), then what about the caches?

The idea is to cut the matrices A And B into large blocks as long as they fit into the caches. In the image below, we highlight a large block of the matrix B of such a size that it fits completely into the L3 cache. Next we take the matrix block A smaller so that it fits into the L2 cache. After that, we return to the matrix B and allocate a subblock in L3 that would fit into the L1 cache. In this way, we will utilize the cache hierarchy as much as possible and reduce the delay in loading matrix elements into registers.

But how do we determine the sizes of these blocks? They depend on the cache sizes on a particular processor and the matrix data type.

How to proceed? Multiply the two blocks A from L2 and B from L1 and go to the matrix block Bwhich is to the right and lies in L3. After the entire block has been multiplied B from L3, load new blocks. And inside these blocks, use the multiplication by 4×4 blocks that we have implemented (or more, if you use all the matrix expansion registers).

But this does not solve the problem with access to matrix elements inside the block. They are in the cache, but before loading into the matrix register we need to transpose the matrix block B. This is where reordering comes in. By copying a large block of the matrix INwhich must fit in the cache, at the same time you can arrange the elements so that when loaded into the register they are sequential in memory. In other words, the leading dimension is equal to the number of matrix columns. You also need to remember about matrix transposition B.

In addition to the aforementioned article by Goto and Geijn, there are many materials on the Internet about optimizing matrix multiplication. If you don’t want to delve into large scientific papers, I can recommend two articles with step-by-step optimization of matrix multiplication:

with

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *