Sorting fixed-length arrays using SIMD
Simply sorting an array is a very simple task, while efficient sorting is very difficult, largely because of the simplicity of the task.
The meaning of this logical paradox is that a complex problem can be solved in many ways, among which you can always try to find an even more effective one. While solving a simple problem is usually possible in only a few ways, it is not so easy to improve, since over decades of work by many people, optimization has achieved a truly impressive result.
The most obvious way to optimize the sorting of an array is to sort its different parts in parallel, but this method faces the problem of high overhead costs for running parallel processes and the disproportionate complexity of solving the problem and the result obtained. As a result, it seems very interesting to use the capabilities of SIMD instructions for sorting, which, even for the “standard” SSE2, can theoretically allow sorting the 4th stream of 32-bit numbers and theoretically achieve a fourfold increase in speed.
Below is a variant of sorting an array of 16, 32-bit unsigned numbers, implemented using the SIMD instructions of the SSE4.1 set. The presented code is essentially a demonstrator of the algorithm and does not imply practical application.
The objections that arrays are rarely of a clearly fixed length, I would like to counter with the remark that large arrays can be divided into sub-arrays of the required length, the algorithm can be easily extended to 32 and 64 members, and arrays of non-multiple lengths can be easily padded” redundant” elements having a maximum/minimum value, which, as a result, during sorting will focus on any of the ends of the array where they can be easily ignored.
The objection that this is a special case only for unsigned 32-bit numbers, I would like to counter with the remark that SIMD instructions allow you to write absolutely similar algorithms for signed integers and single-precision floating-point numbers. The efficiency of working with double precision numbers is no longer so significant, but continues to be preserved.
We will assume that the method takes one single parameter, this is the beginning of the array, which, in accordance with the VECTORCALL calling convention, will be located in the ECX register.
We load all 16 values to be sorted into SIMD registers. This is the first and last access to memory for reading, the next access will be used to write the sorted array. Thus, having received the data once, the algorithm will return an already fully sorted array, minimizing the number of memory accesses.
movdqa xmm1, [rcx + xmmword * 0]
movdqa xmm2, [rcx + xmmword * 1]
movdqa xmm4, [rcx + xmmword * 2]
movdqa xmm5, [rcx + xmmword * 3]
We sort the array with literally 6 commands, combining the number series into eight ordered pairs.
movdqa xmm0, xmm2
pmaxud xmm2, xmm1
pminud xmm1, xmm0
movdqa xmm3, xmm5
pmaxud xmm5, xmm4
pminud xmm4, xmm3
The next step is sorting the resulting pairs into tetrads, and if finding the smallest / largest is not difficult, you just need to compare the smallest / largest number from each pair with the same number from the other pair, then finding “intermediate” numbers is not so obvious, but also quite simply. It is necessary to find the smallest of the largest and the largest of the smallest, and then compare and order them. This requires only 9 instructions.
movdqa xmm0, xmm1
pminud xmm0, xmm4
movdqa xmm3, xmm2
pmaxud xmm3, xmm5
pminud xmm2, xmm5
pmaxud xmm1, xmm4
movdqa xmm4, xmm1
pminud xmm1, xmm2
pmaxud xmm2, xmm4
We transpose the resulting matrix for further sorting. I do not presume to assert that the presented transposition code is the most efficient, perhaps it can be shortened by a couple of cycles, but this is not critical for demonstrating the algorithm.
movdqa xmm4, xmm0
punpckhdq xmm4, xmm1
movdqa xmm5, xmm2
punpckldq xmm5, xmm3
punpckldq xmm0, xmm1
movdqa xmm1, xmm0
punpcklqdq xmm0, xmm5
punpckhqdq xmm1, xmm5
punpckhdq xmm2, xmm3
movdqa xmm3, xmm4
punpckhqdq xmm3, xmm2
punpcklqdq xmm4, xmm2
Now, after transposition, in each register there are four numbers ordered in ascending order. We will make 4 comparisons and exchanges, during which we will unite 4 tetrads into 2 octaves. It is worth noting that if in the first comparison there is an exchange of 4 pairs of numbers, then in subsequent comparisons the number of exchanges decreases by one each time and in the last comparison there is only one exchange.
mov eax, 4
@@:
movdqa xmm2, xmm1
pmaxud xmm1, xmm0
pminud xmm0, xmm2
pshufd xmm1, xmm1, 10010011b
movdqa xmm5, xmm4
pmaxud xmm4, xmm3
pminud xmm3, xmm5
pshufd xmm4, xmm4, 10010011b
dec eax
jnz @b
We perform the last series of comparisons, during which we finally arrange the numerical sequence by comparing two octaves.
mov eax, 8
@@:
movdqa xmm2, xmm3
movdqa xmm5, xmm4
pmaxud xmm3, xmm0
pmaxud xmm4, xmm1
pminud xmm0, xmm2
pminud xmm1, xmm5
pshufd xmm3, xmm3, 10010011b
pshufd xmm4, xmm4, 10010011b
movss xmm5, xmm3
movss xmm3, xmm4
movss xmm4, xmm5
dec eax
jnz @b
Formally, in order for the algorithm to be considered complete, it is necessary to write the received data to memory.
movdqa [rcx + xmmword * 0], xmm0
movdqa [rcx + xmmword * 1], xmm1
movdqa [rcx + xmmword * 2], xmm3
movdqa [rcx + xmmword * 3], xmm4
I hope you were interested, I will be glad to hear interesting, constructive criticism.
PS The efficiency of the algorithm increases significantly if the target machine has AVX.