In pursuit of speed. Optimization of neural network calculations on the K1967VN044 processor from Milandr

In the article

“The Second Birth of DSP or Launch of Neural Networks on K1967BH044 Processors from Milandr”

We examined in general the task of adapting neural networks for the K1967VN044 DSP processor. The features of the processor and possible methods for its effective use were briefly described. In this article we will try to present in more detail one of these methods, namely, the use of a library of assembly functions for optimal calculation of typical operations found in neural networks.

Since code examples in assembler will now appear, you will have to understand it at least in general terms. As it was quite rightly noted, this processor is a development of the TigerSHARC architecture, so programmers familiar with it will easily recognize this code. For those who have not dealt with it, we can recommend the Programming Guide (https://ic.milandr.ru/upload/iblock/77f/77fac90e79704374aaccc4b44f3244d6.pdf), which provides a detailed description of all the capabilities of the processor, taking into account numerous improvements made by Milander.

However, the main ideas migrate from one DSP to another, so such features as the presence of a large number of registers, performing operations with data only on registers, “tricky” instructions, etc. should not come as a surprise. In addition, brief explanations will be given along the way.

First, let's remember what advantages we were going to use to speed up calculations:
1. wide bus for memory access;
2. the ability to execute several instructions per clock cycle;
3. presence of vector instructions;
4. the presence of “advanced” instructions capable of performing complex operations.

Examples

Fast memory copy

Let's look at the simplest example – copying a memory block. Moreover, we will try to analyze it in sufficient detail, since further on the already described techniques will be mentioned in passing.

So, we have the classic memcpy function, which takes as input the source and destination addresses, as well as the number of bytes. And the obvious implementation is to read a byte into a register, write it to the desired address, move the pointers and decrement the loop counter. In the general case, there is no other way to do it, but the first idea is that if, for example, the addresses are word aligned and the number of bytes is a multiple of four, then you can operate with words, which will give a big gain in speed. Of course, I immediately want to develop this idea to its logical conclusion: use the largest possible portions of data, ensuring the necessary alignment. And in our case, we can achieve this because our objects are large enough to incur memory loss during alignment if necessary.

As a result, the function code looks like this:

_copy_i32v8:
    k5:4 = j5:4;  j4 = j4 + 4;;
    j5 = j5 + 4;  lc0 = j6;;
loop1:
    xr3:0 = q[j5 += 8];  yr3:0 = q[k5 += 8];;
    if nlc0e, jump loop1;  q[j4 += 8] = xr3:0;  q[k4 += 8] = yr3:0;;

Let's take a closer look at it.

First, the arguments come in registers j4, j5 and j6. As will be seen later, we need to use both integer modules J and K to be used for addressing, so we copy the source and destination addresses from J to K, and this is done in one instruction. Moreover, on the same line (that is, during the same clock cycle) we shift the receiver pointer (j4) by 4. The processor allows you to read and write a register in one line, and the result of the read will be the previous value.

In the next line we move the source pointer (j5) and set the loop counter.
Why? To do this, look at the next line.

In it, we read into block X 4 words at once at address j5 and 4 words at address k5, and at the same time auto-incrementing of both addresses by 8 is performed.

Now it is clear why the initial shift by 4 was needed, and it is also clear that the counter value should be taken as the number of words divided by 8. In this way, we achieve the maximum possible throughput of the 256-bit bus.

Finally, the last line writes 8 words to addresses j4 and k4, decreases the loop counter and, if it is not zero, jumps to the beginning of the loop. This transition is marked as predicted, so the pipeline does not reset, and the transition itself is obtained “for free”.

Let's take a moment and think about whether the compiler is capable of generating such code?

Well, if he has information that the addresses are aligned (this can be set through an attribute in the function arguments), and the size is a multiple of 32…

The problem with this last circumstance is that only a programmer can know what sizes actually appear. The compiler will likely add a check to the “fast” variant and a “slow” variant to maintain generality.

But that's not all!

The processor has a peculiarity: after loading into modules X and Y, the data does not become available immediately, but only after a clock cycle. So in reality there is a “bubble” inside the loop and one clock cycle is lost.

It's easy to get around it like this:

    xr3:0 = q[j5 += 8];  yr3:0 = q[k5 += 8];;
    xr7:4 = q[j5 += 8];  yr7:4 = q[k5 += 8];;
    q[j4 += 8] = xr3:0;  q[k4 += 8] = yr3:0;;
    if nlc0e, jump loop8;  q[j4 += 8] = xr7:4;  q[k4 += 8] = yr7:4;;

Everything is the same, only now we sequentially read two times 8 words each, and the first portion is ready for recording. The compiler knows about “bubbles”, and in this case it would try in the original version to put some other instruction between reading and writing, which could be executed here without breaking the algorithm, but since there is none, it would put up with the “bubble”. It would be too presumptuous to count on him being able to perform such a kind of “unroll”.
And note, this is the simplest example – something will happen next…

However, our task was not to write a specialized version of memcpy! It's time to take a look at the C code from TVM that we are going to optimize.

It looks something like this: the main function, which allocates memory for temporary objects (this will be discussed in the next article), and calls to layer functions, of which there can be a lot.

And the layer function, in turn, allocates memory for its temporary objects, and then every cycle goes on. Here's one:

  for (int32_t i1 = 0; i1 < 25; ++i1) {
    for (int32_t i2 = 0; i2 < 5; ++i2) {
      for (int32_t i3 = 0; i3 < 64; ++i3) {
        int32_t cse_var_1 = (((i1 * 320) + (i2 * 64)) + i3);
        ((int8_t*)pad_temp_let)[cse_var_1] = p0[cse_var_1];
      }
   }
}

Again – for the first time we will dwell on it longer.

We see a system of three nested loops, with the outer two being the simplest form of “for”, where the variable runs through values ​​from zero to the constant boundary in steps of 1. And this is not an accident; on the contrary, the vast majority of the code looks exactly like this. There is nothing surprising here, because the neural network works with tensors, so the main work is traversing the tensor along the axes (in this case it is two-dimensional) and calculating a certain formula in the inner loop. And, if you look closely, we will see that the cse_var_1 variable in the inner loop is an invariant, to which the value of the i3 loop variable is simply added. It is then used to index the byte array, with the base shift being a multiple of 64, and the size simply being 64.

Bah, yes, we can replace the inner loop with this call to our function:

      copy_i32v8((int32_t *)((int8_t*)pad_temp_let + i1 * 320 + i2 * 64),
                 (int32_t *)(p0 + i1 * 320 + i2 * 64),
                 64 / (2 * 8 * 4));

That is, we calculate two addresses, make sure that the alignment is not violated, adjust the size, and instead of 64 byte iterations we get only a few instructions in the assembly language implementation.

Now, I hope, the main idea of ​​optimization is clear.
We believe that in the TVM code there are a large number of constructions that differ only in certain parameters, and the meaning of the calculation is repeated. And we will look for them, trying a set of templates using a special converter, and replacing internal loops with peculiar intrinsics for which an effective implementation has been written.

Adding a constant to a vector

Here's a slightly more complex example:

for (int32_t ax1_2 = 0; ax1_2 < 25; ++ax1_2) {
    for (int32_t ax2_2 = 0; ax2_2 < 5; ++ax2_2) {
      for (int32_t ax3_2 = 0; ax3_2 < 64; ++ax3_2) {
        int32_t cse_var_7 = (((ax1_2 * 320) + (ax2_2 * 64)) + ax3_2);
        ((int32_t*)conv2d_nhwc_let)[cse_var_7] = (((int32_t*)conv2d_nhwc_let)[cse_var_7] - 128);
      }
    }
  }

Anyone who remembers the previous example will instantly recognize the same tensor, only the data is now 32-bit, and instead of copying, it is subtracting 128 from each element. This is again a typical vector operation, the kernel of which may look like this (this is not the most optimal option):

  xr3:0 = q[j4 + 0];  yr3:0 = q[k4 + 0];;
    r1:0 = r1:0 + r5:4;;
    r3:2 = r3:2 + r7:6;;
    if nlc0e, jump loop6;  q[j4 += 8] = xr3:0;  q[k4 += 8] = yr3:0;;

Here in the prologue (I didn't include it) the number that should be added to all elements of the vector is multiplied in the registers xyr7:4, then we again read the maximum portion of data.

True, the computers are only 64-bit wide, so two instructions per cycle are required, but each of them performs 4 32-bit additions per clock cycle. And if we further refine the implementation to eliminate bubbles, then again we can expect a significant acceleration. The call for this case looks like this:

 add_n_i32v8((int32_t *)conv2d_nhwc_let + ax1_2 * 320 + ax2_2 * 64,
                  -128,
                  64 / (8 * 1));

Multiplication with shift and rounding

Let's skip all the vector addition/subtraction, since it should already be clear how they are processed, and look at a more complex example (I'm not writing out the outer loops anymore):

for (int32_t i3_1 = 0; i3_1 < 64; ++i3_1) {
        int32_t cse_var_6 = (((i1_1 * 320) + (i2_1 * 64)) + i3_1);
        ((int32_t*)conv2d_nhwc_let)[cse_var_6] = ((int32_t)(((((int64_t)((int32_t*)conv2d_nhwc_let)[cse_var_6]) * ((int64_t)((int32_t*)fused_nn_conv2d_subtract_add_constant_6_let)[i3_1])) + ((int64_t)1 << ((int64_t)((((int32_t*)fused_nn_conv2d_subtract_add_constant_8_let)[i3_1] + 31) - 1)))) >> ((int64_t)(((int32_t*)fused_nn_conv2d_subtract_add_constant_8_let)[i3_1] + 31))));
      }

It looks quite confusing, so it's better to write the expression with short names:

*arg1 = ((int64_t)*arg1 * (int64_t)*arg2 + (1LL << (*arg3 + 30))) >> (*arg3 + 31);

Now it becomes clear that we are calculating the 64-bit product of two vectors, and from the results we take only the most significant bits (important: for each element there is a different number), and we also round.

Can the compiler efficiently vectorize such a loop? Actually, it should, since it has all the information, but let's look at the code written by hand:

_op5_v8:  // void op5_v2(int32_t *arg1, const int32_t *arg2, const int32_t *arg3, unsigned qty);
    k7:4 = j7:4;  j4 = j4 + 4; r13 = r13 - r13;;
    j5 = j5 + 4;  lc0 = j7;;
    j6 = j6 + 4;  r12 = (1 << 30);;
    r14 = -31;;
loop5:
    xr3:0 = q[k4 + 0]; yr3:0 = q[j4 + 0];;
    xr7:4 = q[k5 += 8]; yr7:4 = q[j5 += 8];;
    xr11:8 = q[k6 += 8]; yr11:8 = q[j6 += 8];;

    r17:16 = r0 * r4 (i);;
    lr19:18 = ashift r13:12 by r8;;
    r8 = r14 - r8;;
    lr17:16 = r17:16 + r19:18;;
    lr17:16 = ashift r17:16 by r8;;
    r0 = r16;;

    r17:16 = r1 * r5 (i);;
    lr19:18 = ashift r13:12 by r9;;
    r9 = r14 - r9;;
    lr17:16 = r17:16 + r19:18;;
    lr17:16 = ashift r17:16 by r9;;
    r1 = r16;;

    r17:16 = r2 * r6 (i);;
    lr19:18 = ashift r13:12 by r10;;
    r10 = r14 - r10;;
    lr17:16 = r17:16 + r19:18;;
    lr17:16 = ashift r17:16 by r10;;
    r2 = r16;;

    r17:16 = r3 * r7 (i);;
    lr19:18 = ashift r13:12 by r11;;
    r11 = r14 - r11;;
    lr17:16 = r17:16 + r19:18;;
    lr17:16 = ashift r17:16 by r11;;
    r3 = r16;;

    if nlc0e, jump loop5;  q[k4 += 8] = xr3:0; q[j4 += 8] = yr3:0;;

In principle, there is nothing special here, so it’s not even clear what to comment on.

Again, in the prologue, we set up pointers and prepare constants, again we select the maximum portions of data from memory. But please note that the calculation process is structured in such a way that “bubbles” (if possible) do not arise, and most importantly, the results are placed so that they can be written to memory at the end of the cycle in one fell swoop.

Will the compiler be able to find such code? In any case, this is no longer a completely trivial task…

Multiplication in steps

Now let’s review the list that was given at the beginning. Points 1-3 are covered, point 4 remains – specific instructions. Let's try them too.

Let's start again, if possible, with a simple example. We have (I did not give outer loops, there are three of them, and they give xx, yy and ff):

        ((int32_t*)conv2d_nhwc_let)[(((yy * 320) + (xx * 64)) + ff)] = 0;
        for (int32_t rc = 0; rc < 64; ++rc) {
          int32_t cse_var_3 = ((yy * 320) + (xx * 64));
          int32_t cse_var_2 = (cse_var_3 + ff);
          ((int32_t*)conv2d_nhwc_let)[cse_var_2] = (((int32_t*)conv2d_nhwc_let)[cse_var_2] + (((int32_t)((int8_t*)pad_temp_let)[(cse_var_3 + rc)]) * ((int32_t)((int8_t*)fused_constant_2_let)[((rc * 64) + ff)])));
        }

It looks confusing, but if you look closely, we will see how a certain 32-bit cell is zeroed, and then certain products of 8-bit numbers are summed in it, and one of the factors is selected with a constant step.

For the assembler version, it is easier to make a function that will return the value of the sum, and then you can write it in C.

So:

_mul_i8_step_sum_i32_v8:  // int32_t mul_i8_step_add_i32_v8(int8_t *arg1, int8_t *arg2, unsigned step, unsigned qty);
    k7:4 = j7:4;  j4 = j4 + 4;;
    j0 = lshiftl j6 by 2;  lc0 = j7;;
    j5 = j5 + j0;  r1:0 = r1:0 - r1:0;  k0 = j0;;
loop2:
    xr2 = pb[k4 += 8]; yr2 = pb[j4 += 8];;

    xr4 = b[k5 += k6] (se);  yr4 = b[j5 += j6] (se);;
    xr5 = b[k5 += k6] (se);  yr5 = b[j5 += j6] (se);;
    xr6 = b[k5 += k6] (se);  yr6 = b[j5 += j6] (se);;
    xr7 = b[k5 += k6] (se);  yr7 = b[j5 += j6] (se);;

    sr3:2 = expand br2 (i);;

    sr4 = compact r5:4 (i);;
    sr5 = compact r7:6 (i);;

    r7:4 = r3:2 * r5:4 (i);;
    r1:0 = r1:0 + r5:4;;

    if nlc0e, jump loop2;  r1:0 = r1:0 + r7:6;  j5 = j5 + j0;  k5 = k5 + k0;;

    r0 = r0 + r1;;
    xr1 = yr0;;
    xr0 = r0 + r1;;

A full analysis would take up too much space, especially since the reader already recognizes familiar places. What's new here is the following.

To begin with, there is no need to read large blocks, and this is impossible, since in the second the bytes are “staggered”.

Instead, we read eight bytes into the xyr2 registers, and the eight bytes of the second factor with separate instructions into xyr7:4, where they end up with sign extension to 32 bits.

Now the main focus is how will we multiply? The processor can do a lot, but not everything. For our case, multiplication of 16-bit numbers is applicable. But in our country neither of them are like that! And if it’s still easy to convert a 32-bit number in two’s complement code into a 16-bit one, then expanding the sign without hardware support is noticeably more difficult.

Fortunately, everything is provided: the expand instruction converts 4 8-bit numbers into 4 16-bit numbers on two registers, and two compact instructions produce the same 4 16-bit numbers on two registers from 32-bit data.

And then follows an instruction that is completely lost in the code, although it performs 8 (!) 16-bit multiplications (don’t forget that we have two modules X and Y), producing eight 32-bit results. To make them smaller, we accumulate a sum of xyr1:0, and in the epilogue we get a total of four values.

And again, the sacramental question: can the compiler find such a solution?

Truncation with saturation

Well, one last example that will show how large the opening space of possibilities is.

Let’s take two consecutive cycles at once:

      for (int32_t i3_2 = 0; i3_2 < 64; ++i3_2) {
        int32_t cse_var_8 = (((i1_2 * 320) + (i2_2 * 64)) + i3_2);
        int32_t v_ = ((int32_t*)conv2d_nhwc_let)[cse_var_8];
        int32_t v__1 = (v_) < (127) ? (v_) : (127);
        ((int32_t*)conv2d_nhwc_let)[cse_var_8] = ((v__1) > (-128) ? (v__1) : (-128));
      }
      for (int32_t ax3_3 = 0; ax3_3 < 64; ++ax3_3) {
        int32_t cse_var_9 = (((ax1_3 * 320) + (ax2_3 * 64)) + ax3_3);
        T_cast[cse_var_9] = ((int8_t)((int32_t*)conv2d_nhwc_let)[cse_var_9]);
      }

What did you want to say here? So obvious: for 32-bit numbers, 8-bit saturation is calculated, and then the result is written to an array of 8-bit numbers.

Now I won’t even give the entire function, the key part will suffice:

    xr3:0 = q[k5 += 8]; yr3:0 = q[j5 += 8];;

    sr4 = compact r1:0 (si);;
    sr5 = compact r3:2 (si);;

    br6 = compact sr5:4 (si);;

With just three instructions, 8 words are turned into 8 bytes, and with saturation, which does not even require checking the conditions for values. I won’t ask a question about the compiler…

Someone might say: are you going to rewrite all of C in assembler?
No!

I’ll repeat the main idea once again: this is not just random code, it’s an implementation of neural network operations, the number of which is small, and obtained as a result of auto-generation of TVM, which ensures absolutely rigid formatting. That is, the considered structures occur again and again; only the sizes, and perhaps the types, change, so that a support set written once can be used many times.

Well, we should not forget about the “20/80” rule, that is, most of the time is spent in a small amount of code, so it is enough to optimize the heaviest operations.

results

The reader probably has a question: was the game worth the candle? Let's get a look…

We will take the neural network layer from which the examples described in the article were taken and measure its execution time (in processor cycles). And then we will add assembly inserts one at a time, noting the improvements that arise at each step.

The results are as follows (the effect of the insertion is the difference with the previous value):

5813245 — source C code, without inserts, that is, a starting point.

5783619 — fast memory copying (copy_i32v8). Not much, since the block size is very small, but still.

1978783 — multiplication with steps (mul_i8_step_sum_i32_v8). A very significant gain – here the assembler showed itself worthy.

1913388 — subtraction of vectors (not described in the article due to triviality). Not much, but a chicken every grain…

1864005 — addition of vectors (similar).

1614949 — multiplication with shift and rounding (op5_v8). Noticeable improvement.

1557134 — adding a constant to the vector (add_n_i32v8). The result is comparable to adding/subtracting vectors.

1444328 — truncation with saturation (the last example in the article, the name is missing). A noticeable increase.

This is what it looks like:

What can I say?

On the one hand, you shouldn’t throw stones at the compiler so much. We see that although the assembler won everywhere (it would be surprising if not), but only in one case (note, really difficult) the winnings were devastating (almost threefold). The rest look at least tolerable.

On the other hand, the total acceleration was four times! Needless to say, such a result is worth fighting for.

Future plans

As for plans for the future, they are quite clear.

First, you should look for not individual cycles, but their sequences (as in the last example), and combine operations to avoid intermediate copying. And take a closer look at the outer cycles – maybe you will be able to get a win by including them in the optimization.

Secondly, in the future, transfer this entire kitchen directly to TVM so that the result of its work does not require transformations.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *