Theoretical and real-world performance of Intel AMX

Introduction

AMX (Advanced Matrix Extension) is a hardware acceleration module for matrix multiplication that has appeared in Intel Xeon Scalable server processors starting from the 4th generation (Sapphire Rapids architecture).

At the beginning of this year, I finally got my hands on a server with this type of processor.

Specifically, the Xeon(R) Gold 5412U model is a 24-core processor with a clock frequency of 2.1 GHz. At the same time, 8 priority cores can accelerate to 2.3 GHz, and 1 core to 3.9 GHz in Turbo Boost). In addition, this processor supports 8 channel DDR-5 4400 MT/s.

As a person who has devoted quite a long time to optimizing computer vision algorithms and running neural networks on the CPU (libraries Simd And Synet), it was interesting: how much AMX can really speed up calculations and how to get maximum performance out of it.

Next, I will try to answer these questions in as much detail as possible. First of all, I will touch upon issues of single-threaded performance (I will consider multi-threaded performance later).

Description of AMX

First, let's describe what AMX is. It is a control module (Tile Config), in addition, a set of matrix registers and a matrix accelerator, which performs matrix multiplication operations for numbers in the format bfloat16 And int8 (next year the expected release of processors with support in AMX for multiplying numbers in the format float16including complex ones).

The current implementation contains 8 registers of 1024 bytes each, which corresponds to a maximum matrix size of 32×16 (for bfloat16) or 64×16 (for int8).

The control module mentioned above is regulated using a 64-byte configuration register, in which the actual size of each matrix can be set. Multiplication operations are only available between data that is in matrix registers; in addition, there are special operations for loading and unloading data.

bfloat16 format and matrix multiplication

Let's say a few words about the format bfloat16 – unlike the format float16 it has a mantissa of 7 bits, but with a much wider dynamic range. It's basically half bfloat32 with trimmed mantissa precision.

Number formats

Number formats

Using the format bfloat16 leads to a final error of the order of 0.2-0.3%, which, however, is quite sufficient for machine learning purposes.

The operation of multiplying two matrices can be explained with the following pseudocode:

FOR m = 0 TO dst.rows - 1
    FOR k = 0 TO (a.colsb / 4) - 1
        FOR n = 0 TO (dst.colsb / 4) - 1
            dst[m][n] += FP32(a[m][2 * k + 0]) * FP32(b[k][2 * n + 0])
            dst[m][n] += FP32(a[m][2 * k + 1]) * FP32(b[k][2 * n + 1])

Essentially in float32 accumulator, the products of the inputs are summed in pairs bfloat16 matrices Those. if we want to multiply two matrices in the format float32then we simply convert the first one to bfloat16and for the second matrix, in addition to the conversion itself, it will be necessary to mix the elements in even and odd rows.

The simplest example of using AMX

Below is a simple example of using AMX:

#include <immintrin.h>
#include <stdint.h>
#include <iostream>
#include <unistd.h>
#include <sys/syscall.h>

const int ARCH_REQ_XCOMP_PERM = 0x1023;
const int XFEATURE_XTILEDATA = 18;

void ConvertA(const float* src, uint16_t* dst)
{
    __m512 s0 = _mm512_loadu_ps(src + 0 * 16);
    __m512 s1 = _mm512_loadu_ps(src + 1 * 16);
    _mm512_storeu_si512(dst, (__m512i)_mm512_cvtne2ps_pbh(s1, s0));
}

void ConvertB(const float* src, int stride, uint16_t* dst)
{
    static const __m512i PERM_IDX = _mm512_set_epi16(
        0x1f, 0x0f, 0x1e, 0x0e, 0x1d, 0x0d, 0x1c, 0x0c, 
        0x1b, 0x0b, 0x1a, 0x0a, 0x19, 0x09, 0x18, 0x08,
        0x17, 0x07, 0x16, 0x06, 0x15, 0x05, 0x14, 0x04, 
        0x13, 0x03, 0x12, 0x02, 0x11, 0x01, 0x10, 0x00);
    __m512 s0 = _mm512_loadu_ps(src + 0 * stride);
    __m512 s1 = _mm512_loadu_ps(src + 1 * stride);
    __m512i d = (__m512i)_mm512_cvtne2ps_pbh(s1, s0);
    _mm512_storeu_si512(dst, _mm512_permutexvar_epi16(PERM_IDX, d));
} // Конвертация в BF16 с переупорядочиванием четных и нечетных строк.

struct TileConfig
{
    uint8_t paletteId; // должен быть установлен в 1
    uint8_t startRow; // должен быть установлен в 0
    uint8_t reserved[14];
    uint16_t colsb[16]; // актуальная длина строк матриц в байтах
    uint8_t rows[16]; // актуальное число строк в матрицах
};

int main()
{
    // Инициализация AMX в Linux:
    if (syscall(SYS_arch_prctl, 
        ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA) != 0)
    {
        std::cout << "Can't initialize AMX!" << std::endl;
        return 1;
    }

    float A[16][32], B[32][16], C[16][16];

    uint16_t a[16][32];
    for (int i = 0; i < 16; ++i)
        ConvertA(A[i], a[i]);

    uint16_t b[16][32];
    for (int i = 0; i < 16; ++i)
        ConvertB(B[i * 2], 16, b[i]);

    TileConfig conf = {};
    conf.paletteId = 1; 
    conf.rows[0] = 16; 
    conf.colsb[0] = 16 * 4; 
    conf.rows[1] = 16; 
    conf.colsb[1] = 16 * 4;
    conf.rows[2] = 16;
    conf.colsb[2] = 16 * 4;
    _tile_loadconfig(&conf);// Загрузка конфигурации AMX

    _tile_zero(0); // обнуление 0-го рестра

    _tile_loadd(1, a, 64); // загрузка матрицы A в 1-й регистр

    _tile_loadd(2, b, 64); // загрузка матрицы B в 2-й регистр

    _tile_dpbf16ps(0, 1, 2);// непосредственно умножение С += A * B

    _tile_stored(0, C, 64); // сохранение рузультата в матрицу С

    _tile_release(); // очистка AMX конфигурации

    return 0;
}

Let's look at it in a little more detail:

  • First of all, you need to enable the use of AMX registers in the operating system.

  • Next, you need to transform the input values ​​of the matrices A And B from format float32 into format bfloat16, for which there are corresponding instructions in the AVX-512BF16 kit. Do not forget that the values ​​of the even and odd rows of the matrix B must be mixed in pairs.

  • The next step is to set the actual size of the matrices in the registers. This is done by initializing the structure TileConfig and its subsequent loading. Data about the actual size of the matrices will then be used implicitly in the process of loading, multiplying and saving matrices.

  • Next, we load the converted matrices into the registers, reset the accumulator and start the multiplication operation itself.

  • Next, we unload the result into main memory.

  • Don't forget to clear the configuration file after use.

After this toy example, let's look at the real performance of the AMX.

Theoretical performance of matrix multiplication

First, let's test the pure performance of AMX. To exclude the influence of the memory subsystem, we will assume that all input data is in registers.

void PerfBf16L0(int count)
{
    for (int i = 0; i < count; i += 4)
    {
        _tile_dpbf16ps(0, 4, 6);
        _tile_dpbf16ps(1, 4, 7);
        _tile_dpbf16ps(2, 5, 6);
        _tile_dpbf16ps(3, 5, 7);
    }
}

For numbers in the format bfloat16 the speed of multiplying two matrices of size 32×16 takes 16 processor cycles. Which is 3.7 TFLOPS at 3.9 GHz. This is in 16 times more than what can be achieved using the AVX-512 in float32.

void PerfInt8L0(int count)
{
    for (int i = 0; i < count; i += 4)
    {
        _tile_dpbuud(0, 4, 6);
        _tile_dpbuud(1, 4, 7);
        _tile_dpbuud(2, 5, 6);
        _tile_dpbuud(3, 5, 7);
    }
}

For integers in the format int8 the multiplication speed of two 64×16 matrices is the same 16 clock cycles, which allows you to achieve 7.4 TOPS at a frequency of 3.9 GHz. This is in 8 times more than can be achieved using the AVX-512VNNI.

Separately, we note that the multiplication speed does not depend on the actual size of the matrices; if you have to multiply matrices of a smaller size, then it still takes 16 clock cycles. In addition, tests show that there is actually one matrix multiplication device in the kernel, which performs all multiplication operations sequentially.

Practical Limitations

Of course theoretically AMX provides 16 x 32 * 2 = 1024 operations bfloat16 per clock cycle, however, it is practically necessary to constantly load the source data into the registers and unload the result from them, which will greatly spoil the picture. Well, in addition, do not forget about the need to convert the source data into the format bfloat16.

If we approach the problem head-on, then for each matrix multiplication operation C+=A*Brequires 3 loads and one save, which leads to a catastrophic drop in performance due to I/O operations.

In reality, they usually try to use some of the registers as accumulators, which avoids constant loading and saving of the C matrix.

This ultimately reduces to the need for 1 load per matrix multiplication operation:

In case of a clock frequency of 3.9 GHz, to achieve a peak performance of 3.7 TFLOPS, a throughput of at least 250 GB/s.

Next, we will test the data loading speed and check whether it is enough to ensure the operation of the matrix registers.

AMX register loading speed

In the first case, the data is arranged compactly:

void LoadCompact(int count, uint8_t* buf)
{
    for (int i = 0; i < count; i++)
        _tile_loadd(0, buf + i * 1024, 64);
}

In the second, we simply load the data, row by row:

void LoadLongRows(int count, uint8_t* buf)
{
    for (int i = 0; i < count; i++)
        _tile_loadd(0, buf + i * 64, 64 * count);
}
Chart

As can be seen from this graph, loading data that is in the L1 cache (its size is 48 kB) provides download speeds of up to 380-390 GB/s, which is about 75% of the theoretical maximum. This is theoretically enough to completely utilize the computing power of the matrix accelerator. However, the L2 cache bandwidth (its size is 2 MB) at the level of 170-180 GB/s (70% of the theoretical maximum) is not enough to completely utilize the matrix accelerator. L3 cache bandwidth is only 32 GB/s, which is only slightly higher than single-threaded memory bandwidth of 20-21 GB/s. In addition, it is only 1.9 MB per core, which is even smaller than the L2 cache.

These bandwidth limitations naturally affect overall performance, as the following test can clearly demonstrate:

void PerfBf16L1(int count, uint8_t* buf, bool update, bool save)
{
    uint8_t* A0 = buf + 4 * 1024, *A1 = A0 + count * 1024;
    uint8_t* B0 = A1 + count * 1024, *B1 = B0 + count * 1024;
    if (update)
    {
        _tile_stream_loadd(0, buf + 0 * 1024, 64);
        _tile_stream_loadd(1, buf + 1 * 1024, 64);
        _tile_stream_loadd(2, buf + 2 * 1024, 64);
        _tile_stream_loadd(3, buf + 3 * 1024, 64);
    }
    else
    {
        _tile_zero(0);
        _tile_zero(1);
        _tile_zero(2);
        _tile_zero(3);
    }
    for (int i = 0; i < count; i++)
    {
        _tile_loadd(4, A0 + i * 1024, 64);
        _tile_loadd(5, A1 + i * 1024, 64);
        _tile_loadd(6, B0 + i * 1024, 64);
        _tile_loadd(7, B1 + i * 1024, 64);
        _tile_dpbf16ps(0, 4, 6);
        _tile_dpbf16ps(1, 4, 7);
        _tile_dpbf16ps(2, 5, 6);
        _tile_dpbf16ps(3, 5, 7);
    }
    if (save)
    {
        _tile_stored(0, buf + 0 * 1024, 64);
        _tile_stored(1, buf + 1 * 1024, 64);
        _tile_stored(2, buf + 2 * 1024, 64);
        _tile_stored(3, buf + 3 * 1024, 64);
    }
}

Essentially this test is a simplified microkernel from the real algorithm matrix multiplication. C[32] *= A[32][K] *B[K][32] – multiplying a block of rows by a block of columns. For clarity, here is a diagram:

The graphs below show actually achievable performance indicators at the microkernel level:

Points scored

It can be seen that under ideal conditions, when all data is in the L1 cache, performance can reach 3.3 TFLOPS, which is 90% of the theoretical maximum. If we take into account that the results need to be saved somewhere, then a value of 2.9 – 3.0 TFLOPS (80% efficiency) looks more realistic.

When data is localized in the L2 cache, performance can reach 2.6 – 2.7 TFLOPS, which is about 70% of the maximum possible and is very consistent with the actual L2 cache throughput that we measured earlier. Further, if the data is not localized in the L1 or L2 cache, then a dramatic drop in performance occurs. Therefore, any efficient algorithm that uses AMX must revolve around localizing the operating data within 2 MB.

Matrix multiplication

Well, the simplest algorithms, in this case, will be matrix multiplication. Let's try to implement it using AMX. I’ll say right away that the example is simplified; it assumes that the size of all matrices is a multiple of 32.

First, let's implement the microkernel:

void GemmMicro(int K, const uint16_t* A0, const uint16_t* A1,
    const uint16_t* B0, const uint16_t* B1,
    float* C, int ldc, bool update)
{
    if (update)
    {
        _tile_stream_loadd(0, C, ldc * 4);
        _tile_stream_loadd(1, C + 16, ldc * 4);
        _tile_stream_loadd(2, C + 16 * ldc, ldc * 4);
        _tile_stream_loadd(3, C + 16 * ldc + 16, ldc * 4);
    }
    else
    {
        _tile_zero(0);
        _tile_zero(1);
        _tile_zero(2);
        _tile_zero(3);
    }
    for (int k = 0; k < K; k += 32)
    {
        _tile_stream_loadd(4, A0 + k * 16, 64);
        _tile_stream_loadd(5, A1 + k * 16, 64);
        _tile_loadd(6, B0 + k * 16, 64);
        _tile_loadd(7, B1 + k * 16, 64);
        _tile_dpbf16ps(0, 4, 6);
        _tile_dpbf16ps(1, 4, 7);
        _tile_dpbf16ps(2, 5, 6);
        _tile_dpbf16ps(3, 5, 7);
    }
    _tile_stored(0, C, ldc * 4);
    _tile_stored(1, C + 16, ldc * 4);
    _tile_stored(2, C + 16 * ldc, ldc * 4);
    _tile_stored(3, C + 16 * ldc + 16, ldc * 4);
}

Here we load data from matrix B from two quasi-columns (blocks of 16 columns) localized in the L1 cache. We load matrix A data from quasi-rows (blocks of 16 rows), which must be localized in the L2 cache. For this purpose we use a special instruction _tile_stream_loadd – which loads data directly, bypassing top-level caches (so as not to displace matrix B data from L1). Similarly, if necessary, the values ​​of matrix C are loaded.

Next we will create a macro kernel:

void ConvertA(int K, const float* A, int lda, uint16_t* buf)
{
    for (int k = 0; k < K; k += 32, A += 32)
        for (int i = 0; i < 16; ++i, buf += 32)
            ConvertA(A + i * lda, buf);
}

void ConvertB(int K, const float* B, int ldb, uint16_t* buf)
{
    for (int k = 0; k < K; k += 2, B += 2 * ldb, buf += 32)
        ConvertB(B, ldb, buf);
}

void GemmMacro(int M, int N, int K,
    const float* A, int lda, uint16_t* bufA,
    const float* B, int ldb, uint16_t* bufB,
    int convertB, float* C, int ldc, bool update)
{
    uint64_t n = 0;
    for (int j = 0; j < N; j += 32)
    {
        uint16_t* B0 = bufB + j * K;
        uint16_t* B1 = bufB + (j + 16) * K;
        if (convertB)
        {
            ConvertB(K, B + j + 0, ldb, B0);
            ConvertB(K, B + j + 16, ldb, B1);
        }
        for (int i = 0; i < M; i += 32)
        {
            uint16_t* A0 = bufA + i * K;
            uint16_t* A1 = bufA + (i + 16) * K;
            if (j == 0)
            {
                ConvertA(K, A + i * lda, lda, A0);
                ConvertA(K, A + (i + 16) * lda, lda, A1);
            }
            GemmMicro(K, A0, A1, B0, B1, C + i * ldc + j, ldc, update);
        }
    }
}

The data that this function uses is localized in a level 2-3 cache. If necessary, they are loaded from memory (at the same time, a conversion from float32 to bfloat16 is performed for matrix A and also reordering for matrix B).

And finally, the matrix multiplication function itself:

void GemmFunc(int M, int N, int K, const float* A, const float* B, float* C)
{
    TileConfig conf = {};
    conf.paletteId = 1;
    for (size_t i = 0; i < 8; ++i)
    {
        conf.rows[i] = 16;
        conf.colsb[i] = 64;
    }
    _tile_loadconfig(&conf);

    const int L1 = 48 * 1024, L2 = 2 * 1024 * 1024, L3 = 45 * 1024 * 1024;
    int mK = std::min(L1 / 2 / 32, K) / 32 * 32;
    int mM = std::min(int(L2 * 0.5) / 2 / mK, M) / 32 * 32;
    int mN = std::min(int(L3 * 0.1) / 2 / mK, N) / 32 * 32;
    std::vector<uint16_t> bufA(mK * mM), bufB(mN * mK);
    for (int j = 0; j < N; j += mN)
    {
        int dN = std::min(N, j + mN) - j;
        for (int k = 0; k < K; k += mK)
        {
            int dK = std::min(K, k + mK) - k;
            for (int i = 0; i < M; i += mM)
            {
                int dM = std::min(M, i + mM) - i;
                GemmMacro(dM, dN, dK,
                    A + i * K + k, K, bufA.data(),
                    B + k * N + j, N, bufB.data(), i == 0,
                    C + i * N + j, N, k != 0);
            }
        }
    }
    _tile_release();
}

First, we configure all registers to the maximum size. Next, we allocate two buffers for storing blocks of matrices A and B with data prepared for AMX.

Below is a diagram to better understand the order in which the data is traversed:

Block size A – choose 50% of the L2 cache, B – 10% of the total L3 cache. Why can only half of the L2 cache be used effectively? Apparently because it is quite actively clogged with unused data from matrices B and C. I have not yet found an answer to how to deal with this. Using the L3 cache – here, with the current implementation of the algorithm, a larger percentage does not provide any acceleration.

The result is visible in the graph below:

Points scored

It can be seen that the performance of AMX on matrix multiplication reaches 1.4 TFLOPS, which is somewhere 37% from the theoretical maximum. On the one hand, this seems quite modest, on the other hand it is 7.5 times faster than AVX-512.

Reasons for low efficiency

In short – AMX is too fast for the current L1-L2 cache size and L3 and main memory bandwidth. In the Xeon Max series of processors with integrated high-speed HBM memory, these problems are significantly eliminated to the extent possible, but unfortunately I can’t check this personally yet. Only for the Xeon Max can AMX reveal its potential, but these processors are rare, and the price tag on them is not entirely humane, to put it mildly.

What else can you do

Regarding the possibility of increasing performance on current equipment, then for a typical application of AMX – neural network inference the following can be suggested:

  1. Convert and rearrange data in advance. At least for scales this can always be done. This will give a benefit both in the absence of conversion and in reducing the load on the memory subsystem.

  2. For bundles 2×2, 3×3, etc. a more compact arrangement of data is possible, which allows for a larger volume of calculations to be done on it, which in theory should increase the degree of utilization of AMX.

conclusions

If you suddenly run neural networks on a CPU, then using AMX is just what the doctor ordered. Yes, even if it is not possible to use all its power entirely, but even what is actually achievable is quite impressive (acceleration 7.5 times does not lie on the road). Yes, you will have to tinker, and naturally, on average, the acceleration will be much more modest, but it will still be multiple for a large class of models. In this article, I left behind the scenes the issues of multithreading and testing on real tasks, perhaps this will be the basis for future articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *