Sorting fixed-length arrays using SIMD

Simply sorting an array is a very simple task, while efficient sorting is very difficult, largely because of the simplicity of the task.

The meaning of this logical paradox is that a complex problem can be solved in many ways, among which you can always try to find an even more effective one. While solving a simple problem is usually possible in only a few ways, it is not so easy to improve, since over decades of work by many people, optimization has achieved a truly impressive result.

The most obvious way to optimize the sorting of an array is to sort its different parts in parallel, but this method faces the problem of high overhead costs for running parallel processes and the disproportionate complexity of solving the problem and the result obtained. As a result, it seems very interesting to use the capabilities of SIMD instructions for sorting, which, even for the “standard” SSE2, can theoretically allow sorting the 4th stream of 32-bit numbers and theoretically achieve a fourfold increase in speed.

Below is a variant of sorting an array of 16, 32-bit unsigned numbers, implemented using the SIMD instructions of the SSE4.1 set. The presented code is essentially a demonstrator of the algorithm and does not imply practical application.

The objections that arrays are rarely of a clearly fixed length, I would like to counter with the remark that large arrays can be divided into sub-arrays of the required length, the algorithm can be easily extended to 32 and 64 members, and arrays of non-multiple lengths can be easily padded” redundant” elements having a maximum/minimum value, which, as a result, during sorting will focus on any of the ends of the array where they can be easily ignored.

The objection that this is a special case only for unsigned 32-bit numbers, I would like to counter with the remark that SIMD instructions allow you to write absolutely similar algorithms for signed integers and single-precision floating-point numbers. The efficiency of working with double precision numbers is no longer so significant, but continues to be preserved.

We will assume that the method takes one single parameter, this is the beginning of the array, which, in accordance with the VECTORCALL calling convention, will be located in the ECX register.

We load all 16 values ​​to be sorted into SIMD registers. This is the first and last access to memory for reading, the next access will be used to write the sorted array. Thus, having received the data once, the algorithm will return an already fully sorted array, minimizing the number of memory accesses.

	movdqa xmm1, [rcx + xmmword * 0]
	movdqa xmm2, [rcx + xmmword * 1]
	movdqa xmm4, [rcx + xmmword * 2]
	movdqa xmm5, [rcx + xmmword * 3]

We sort the array with literally 6 commands, combining the number series into eight ordered pairs.

	movdqa xmm0, xmm2
	pmaxud xmm2, xmm1
	pminud xmm1, xmm0
	movdqa xmm3, xmm5
	pmaxud xmm5, xmm4
	pminud xmm4, xmm3

The next step is sorting the resulting pairs into tetrads, and if finding the smallest / largest is not difficult, you just need to compare the smallest / largest number from each pair with the same number from the other pair, then finding “intermediate” numbers is not so obvious, but also quite simply. It is necessary to find the smallest of the largest and the largest of the smallest, and then compare and order them. This requires only 9 instructions.

	movdqa xmm0, xmm1
	pminud xmm0, xmm4
	movdqa xmm3, xmm2
	pmaxud xmm3, xmm5

	pminud xmm2, xmm5
	pmaxud xmm1, xmm4
	movdqa xmm4, xmm1
	pminud xmm1, xmm2
	pmaxud xmm2, xmm4

We transpose the resulting matrix for further sorting. I do not presume to assert that the presented transposition code is the most efficient, perhaps it can be shortened by a couple of cycles, but this is not critical for demonstrating the algorithm.

	movdqa xmm4, xmm0
	punpckhdq xmm4, xmm1
	movdqa xmm5, xmm2
	punpckldq xmm5, xmm3
	punpckldq  xmm0, xmm1
	movdqa     xmm1, xmm0
	punpcklqdq xmm0, xmm5
	punpckhqdq xmm1, xmm5
	punpckhdq  xmm2, xmm3
	movdqa     xmm3, xmm4
	punpckhqdq xmm3, xmm2
	punpcklqdq xmm4, xmm2

Now, after transposition, in each register there are four numbers ordered in ascending order. We will make 4 comparisons and exchanges, during which we will unite 4 tetrads into 2 octaves. It is worth noting that if in the first comparison there is an exchange of 4 pairs of numbers, then in subsequent comparisons the number of exchanges decreases by one each time and in the last comparison there is only one exchange.

	mov eax, 4
	movdqa xmm2, xmm1
	pmaxud xmm1, xmm0
	pminud xmm0, xmm2
	pshufd xmm1, xmm1, 10010011b
	movdqa xmm5, xmm4
	pmaxud xmm4, xmm3
	pminud xmm3, xmm5
	pshufd xmm4, xmm4, 10010011b
	dec eax
jnz @b

We perform the last series of comparisons, during which we finally arrange the numerical sequence by comparing two octaves.

	mov eax, 8
	movdqa xmm2, xmm3
	movdqa xmm5, xmm4
	pmaxud xmm3, xmm0
	pmaxud xmm4, xmm1
	pminud xmm0, xmm2
	pminud xmm1, xmm5
	pshufd xmm3, xmm3, 10010011b
	pshufd xmm4, xmm4, 10010011b
	movss  xmm5, xmm3
	movss  xmm3, xmm4
	movss  xmm4, xmm5
	dec eax
jnz @b

Formally, in order for the algorithm to be considered complete, it is necessary to write the received data to memory.

	movdqa [rcx + xmmword * 0], xmm0
	movdqa [rcx + xmmword * 1], xmm1
	movdqa [rcx + xmmword * 2], xmm3
	movdqa [rcx + xmmword * 3], xmm4

I hope you were interested, I will be glad to hear interesting, constructive criticism.

PS The efficiency of the algorithm increases significantly if the target machine has AVX.

Similar Posts

Leave a Reply