Optimizing access to slice elements in Go

Task

Initially, the task was to build a collage of several pictures sized 512x512px, which was reduced to the task of copying the blue component of the picture image.RGBA to palette color index in image.Paletted. For research purposes, I simplified the task to: you need to copy the blue component from the image image.RGBA 512x512px in size image.Paletted the same size.

The solution using the methods of the image package looks like this:

func BenchmarkCopySTD(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for y := 0; y < imageHeight; y++ {
			dstPixOffset := dstPaletted.PixOffset(0, y)
			for x := 0; x < imageWidth; x++ {
				_, _, b, _ := srcRGBA.At(x, y).RGBA()
				dstPaletted.Pix[dstPixOffset+x] = uint8(b)
			}
		}
	}
}

It’s long, a lot of time is spent on challenges At(). We know the types of images, so the obvious solution is to work directly with pixels:

func BenchmarkCopySliceElements(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for y := 0; y < imageHeight; y++ {
			srcPixOffset := srcRGBA.PixOffset(0, y)
			dstPixOffset := dstPaletted.PixOffset(0, y)
			for x := 0; x < imageWidth; x++ {
				dstPaletted.Pix[dstPixOffset+x] = srcRGBA.Pix[srcPixOffset+x*4+1]
			}
		}
	}
}

Immediately we get a big boost:

BenchmarkCopySTD-16                	    1590	   3730930 ns/op	 1048600 B/op	  262144 allocs/op
BenchmarkCopySliceElements-16      	   22778	    260128 ns/op	       0 B/op	       0 allocs/op

Then comes the black magic unsafe.

Getting rid of the Bounds Check

If you look at the assembler, you can be very surprised at the number of operations required to copy one slice element to another:

In fact, this is a check for missing slice boundaries. It can be turned off globally when compiling, but in this case we need to turn it off only in one place in the program. To do this, you need to access directly the address of the desired element, which can be calculated using the address of the zero element. The rewritten code looks like this:

func BenchmarkCopyUnsafe(b *testing.B) {
	for i := 0; i < b.N; i++ {
		dstAddr := uintptr(unsafe.Pointer(&dstPaletted.Pix[0]))
		srcAddr := uintptr(unsafe.Pointer(&srcRGBA.Pix[0]))
		for y := 0; y < imageHeight; y++ {
			srcPixOffset := srcRGBA.PixOffset(0, y)
			dstPixOffset := dstPaletted.PixOffset(0, y)
			for x := 0; x < imageWidth; x++ {
				//dst.Pix[dstPixOffset+x] = src.Pix[srcPixOffset+x*4+1]
				*(*uint8)((unsafe.Pointer)(dstAddr + uintptr(dstPixOffset+x))) = *(*uint8)((unsafe.Pointer)(srcAddr + uintptr(srcPixOffset+x*4+2)))
			}
		}
	}
}

In lines (3) and (4) we get addresses in memory of zero elements, and then we refer to them by offset relative to the address. The time for copying was reduced by 2 times:

BenchmarkCopySTD-16                	    1590	   3730930 ns/op	 1048600 B/op	  262144 allocs/op
BenchmarkCopySliceElements-16      	   22778	    260128 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnsafe-16             	   44539	    131881 ns/op	       0 B/op	       0 allocs/op

The number of CPU instructions has been greatly reduced:

Such a solution was sent to the competition, but the number of instructions still confused me, and after the competition I began to look for ways to speed it up.

Get rid of complex math

The first thing that catches your eye is multiplication. Plus, when working with memory directly, you can not calculate the offset, but simply juggle with pointers:

func BenchmarkCopyUnsafeV2(b *testing.B) {
	for i := 0; i < b.N; i++ {
		dstAddr := uintptr(unsafe.Pointer(&dstPaletted.Pix[0]))
		srcAddr := uintptr(unsafe.Pointer(&srcRGBA.Pix[0])) + 2
		for y := 0; y < imageHeight; y++ {
			for x := 0; x < imageWidth; x++ {
				*(*uint8)((unsafe.Pointer)(dstAddr)) = *(*uint8)((unsafe.Pointer)(srcAddr))
				dstAddr++
				srcAddr += 4
			}
		}
	}
}

The time was further reduced by 1.5 times:

BenchmarkCopySTD-16                	    1590	   3730930 ns/op	 1048600 B/op	  262144 allocs/op
BenchmarkCopySliceElements-16      	   22778	    260128 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnsafe-16             	   44539	    131881 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnsafeV2-16           	   65779	     88408 ns/op	       0 B/op	       0 allocs/op

CPU instructions left at least:

It seems that everything can not be faster, but …

Loop unrolling

At some point, I thought, what would a C++ compiler do with -O3 and it gave me the idea why Go didn’t optimize”Loop unwinding“? We must try to do it by hand:

func BenchmarkCopyUnrolledLoop4(b *testing.B) {
	for i := 0; i < b.N; i++ {
		dstAddr := uintptr(unsafe.Pointer(&dstPaletted.Pix[0]))
		srcAddr := uintptr(unsafe.Pointer(&srcRGBA.Pix[0]))
		for y := 0; y < imageHeight; y++ {
			for x := 0; x < imageWidth; x += 4 {
				*(*uint8)((unsafe.Pointer)(dstAddr + 0)) = *(*uint8)((unsafe.Pointer)(srcAddr + 0))
				*(*uint8)((unsafe.Pointer)(dstAddr + 1)) = *(*uint8)((unsafe.Pointer)(srcAddr + 4))
				*(*uint8)((unsafe.Pointer)(dstAddr + 2)) = *(*uint8)((unsafe.Pointer)(srcAddr + 8))
				*(*uint8)((unsafe.Pointer)(dstAddr + 3)) = *(*uint8)((unsafe.Pointer)(srcAddr + 12))

				dstAddr += 4
				srcAddr += 16
			}
		}
	}
}

And it sped up another third. I actually tried different number of operations in 1 cycle, including all 512px, but the best result, at least on my CPU, was shown by the variant with 4 operations:

BenchmarkCopySTD-16                	    1590	   3730930 ns/op	 1048600 B/op	  262144 allocs/op
BenchmarkCopySliceElements-16      	   22778	    260128 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnsafe-16             	   44539	    131881 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnsafeV2-16           	   65779	     88408 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnrolledLoop2-16      	   88812	     66479 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnrolledLoop4-16      	   95296	     58995 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnrolledLoop8-16      	   95563	     62937 ns/op	       0 B/op	       0 allocs/op
BenchmarkCopyUnrolledLoop512-16    	   95064	     62808 ns/op	       0 B/op	       0 allocs/op

Conclusion

It seems that I can stop here, the speedup of 63 times with respect to function calls and almost 4.5 times with respect to copying elements by index looks good. Then you can try to play around with vector commands, maybe next time. But for those who wish, I made a task on HighLoad.Funwhere you can try your hand at speeding up code, not only in Go, but also in C++ or Rust.

PS For the sake of interest, I asked ChatGPT how to speed up BenchmarkCopySliceElements, he suggested:

  1. Using goroutines is irrelevant in my case, because the rest of the CPU cores are already busy doing the same work for other pictures.

  2. Use SIMD, but did not offer a working solution.

  3. He suggested using a faster library for working with pictures, but did not say which one.

  4. After clarifying questions, he suggested expanding the cycle, but I already did it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *