Simply sorting an array is a very simple task, while efficient sorting is very difficult, largely because of the simplicity of the task.
The meaning of this logical paradox is that a complex problem can be solved in many ways, among which you can always try to find an even more effective one. While solving a simple problem is usually possible in only a few ways, it is not so easy to improve, since over decades of work by many people, optimization has achieved a truly impressive result.
The most obvious way to optimize the sorting of an array is to sort its different parts in parallel, but this method faces the problem of high overhead costs for running parallel processes and the disproportionate complexity of solving the problem and the result obtained. As a result, it seems very interesting to use the capabilities of SIMD instructions for sorting, which, even for the “standard” SSE2, can theoretically allow sorting the 4th stream of 32-bit numbers and theoretically achieve a fourfold increase in speed.
Below is a variant of sorting an array of 16, 32-bit unsigned numbers, implemented using the SIMD instructions of the SSE4.1 set. The presented code is essentially a demonstrator of the algorithm and does not imply practical application.
The objections that arrays are rarely of a clearly fixed length, I would like to counter with the remark that large arrays can be divided into sub-arrays of the required length, the algorithm can be easily extended to 32 and 64 members, and arrays of non-multiple lengths can be easily padded” redundant” elements having a maximum/minimum value, which, as a result, during sorting will focus on any of the ends of the array where they can be easily ignored.
The objection that this is a special case only for unsigned 32-bit numbers, I would like to counter with the remark that SIMD instructions allow you to write absolutely similar algorithms for signed integers and single-precision floating-point numbers. The efficiency of working with double precision numbers is no longer so significant, but continues to be preserved.
We will assume that the method takes one single parameter, this is the beginning of the array, which, in accordance with the VECTORCALL calling convention, will be located in the ECX register.
We load all 16 values to be sorted into SIMD registers. This is the first and last access to memory for reading, the next access will be used to write the sorted array. Thus, having received the data once, the algorithm will return an already fully sorted array, minimizing the number of memory accesses.
movdqa xmm1, [rcx + xmmword * 0] movdqa xmm2, [rcx + xmmword * 1] movdqa xmm4, [rcx + xmmword * 2] movdqa xmm5, [rcx + xmmword * 3]
We sort the array with literally 6 commands, combining the number series into eight ordered pairs.
movdqa xmm0, xmm2 pmaxud xmm2, xmm1 pminud xmm1, xmm0 movdqa xmm3, xmm5 pmaxud xmm5, xmm4 pminud xmm4, xmm3
The next step is sorting the resulting pairs into tetrads, and if finding the smallest / largest is not difficult, you just need to compare the smallest / largest number from each pair with the same number from the other pair, then finding “intermediate” numbers is not so obvious, but also quite simply. It is necessary to find the smallest of the largest and the largest of the smallest, and then compare and order them. This requires only 9 instructions.
movdqa xmm0, xmm1 pminud xmm0, xmm4 movdqa xmm3, xmm2 pmaxud xmm3, xmm5 pminud xmm2, xmm5 pmaxud xmm1, xmm4 movdqa xmm4, xmm1 pminud xmm1, xmm2 pmaxud xmm2, xmm4
We transpose the resulting matrix for further sorting. I do not presume to assert that the presented transposition code is the most efficient, perhaps it can be shortened by a couple of cycles, but this is not critical for demonstrating the algorithm.
movdqa xmm4, xmm0 punpckhdq xmm4, xmm1 movdqa xmm5, xmm2 punpckldq xmm5, xmm3 punpckldq xmm0, xmm1 movdqa xmm1, xmm0 punpcklqdq xmm0, xmm5 punpckhqdq xmm1, xmm5 punpckhdq xmm2, xmm3 movdqa xmm3, xmm4 punpckhqdq xmm3, xmm2 punpcklqdq xmm4, xmm2
Now, after transposition, in each register there are four numbers ordered in ascending order. We will make 4 comparisons and exchanges, during which we will unite 4 tetrads into 2 octaves. It is worth noting that if in the first comparison there is an exchange of 4 pairs of numbers, then in subsequent comparisons the number of exchanges decreases by one each time and in the last comparison there is only one exchange.
mov eax, 4 @@: movdqa xmm2, xmm1 pmaxud xmm1, xmm0 pminud xmm0, xmm2 pshufd xmm1, xmm1, 10010011b movdqa xmm5, xmm4 pmaxud xmm4, xmm3 pminud xmm3, xmm5 pshufd xmm4, xmm4, 10010011b dec eax jnz @b
We perform the last series of comparisons, during which we finally arrange the numerical sequence by comparing two octaves.
mov eax, 8 @@: movdqa xmm2, xmm3 movdqa xmm5, xmm4 pmaxud xmm3, xmm0 pmaxud xmm4, xmm1 pminud xmm0, xmm2 pminud xmm1, xmm5 pshufd xmm3, xmm3, 10010011b pshufd xmm4, xmm4, 10010011b movss xmm5, xmm3 movss xmm3, xmm4 movss xmm4, xmm5 dec eax jnz @b
Formally, in order for the algorithm to be considered complete, it is necessary to write the received data to memory.
movdqa [rcx + xmmword * 0], xmm0 movdqa [rcx + xmmword * 1], xmm1 movdqa [rcx + xmmword * 2], xmm3 movdqa [rcx + xmmword * 3], xmm4
I hope you were interested, I will be glad to hear interesting, constructive criticism.
PS The efficiency of the algorithm increases significantly if the target machine has AVX.