features of internal implementation

Atomics in Go is one of the methods for synchronizing goroutines. They are in the standard library package sync/atomic. Some articles compare atoms With mutex, since these are low-level synchronization primitives. They provide benchmarks and speed comparisons, such as Go: How to Reduce Lock Contention with the Atomic Package.

However, it is important to understand that while these are low-level synchronization primitives, they are inherently different. First of all, atomics are “low-level atomic memory primitives”, as noted in the documentation, that is, they are low-level primitives that implement atomic memory operations. In this article I will talk about some of the features of their internal implementation and the difference from mutexes.

internals of atomics

Let’s first take an example from the documentation and look at the Swap operation:

The swap operation, implemented by the SwapT functions, is the atomic equivalent of:

old = *addr
*addr = new
return old

SwapT refers to all Swap operations with various data types. Let’s take SwapInt64 as an example. The function is described in sync/atomic/doc.go:

func SwapInt64(addr *int64, new int64) (old int64)

However, its implementation is no longer in Go, but in assembler and is in sync/atomic/asm.s:

TEXT ·SwapInt64(SB),NOSPLIT,$0
	JMP	runtime∕internal∕atomic·Xchg64(SB)

However, here we see a jump to another function (simple jump) called Xchg64 and this function is in the Go runtime. Here we can already see the division by processor architectures.

Here is the code for 64-bit Intel 386:

// uint64 Xchg64(ptr *uint64, new uint64)
// Atomically:
//	old := *ptr;
//	*ptr = new;
//	return old;
TEXT ·Xchg64(SB), NOSPLIT, $0-24
	MOVQ	ptr+0(FP), BX
	MOVQ	new+8(FP), AX
	XCHGQ	AX, 0(BX)
	MOVQ	AX, ret+16(FP)
	RET

And this one is for ARM64:

// uint64 Xchg64(ptr *uint64, new uint64)
// Atomically:
//	old := *ptr;
//	*ptr = new;
//	return old;
TEXT ·Xchg64(SB), NOSPLIT, $0-24
	MOVD	ptr+0(FP), R0
	MOVD	new+8(FP), R1
	MOVBU	internal∕cpu·ARM64+const_offsetARM64HasATOMICS(SB), R4
	CBZ 	R4, load_store_loop
	SWPALD	R1, (R0), R2
	MOVD	R2, ret+16(FP)
	RET
load_store_loop:
	LDAXR	(R0), R2
	STLXR	R1, (R0), R3
	CBNZ	R3, load_store_loop
	MOVD	R2, ret+16(FP)
	RET

It’s worth noting here that Go uses its own assembly language. This is done to compile for various platforms and you can read more about it, for example, here: A Quick Guide to Go’s Assembler. It is important to note that the compiler operates on a semi-abstract instruction set. Instruction selection occurs in part after code generation. For example, the MOV operation can end up as a separate operation, or it can be converted into a set of instructions, and this will depend on the processor architecture. The language itself is based on Plan 8 assembler.

Thus, we cannot always be sure from the standard library code that there will be no changes in the compiled code for our architecture. Let’s see what code will be compiled as a result for the operation in question SwapInt64:

package main

import (
	"sync/atomic"
)

func main() {
	var old, new int64 = 1, 10
	println(old, new)
	new = atomic.SwapInt64(&old, new)
	println(old, new)
}

I used IDA64 to parse a binary file, but I advise you to look at the disassembled code in Compiler Explorer (you can choose different architectures, versions of Go, etc.). It will also be interesting to look at the compilation steps, ast representations, and optimizations applied in Go SSA Playground. Now let’s find the main function in the disassembled code:

Code for new = atomic.SwapInt64(&old, new) located exactly after the first call runtime_printunlock and until the next сall runtime_printlock

mov     ecx, 0Ah
mov     rdx, [rsp+28h+var_10]
xchg    rcx, [rdx]
mov     [rsp+28h+var_18], rcx

We have only four instructions: three mov and one xchg. Further analysis is difficult because the number of cycles of a particular instruction may depend on several factors, such as the model and architecture of the processor, types of operands (registers, memory) and some other conditions (cache miss, for example). If you are interested in more detailed calculations and details, then you can refer to this manual or to Intel® 64 and IA-32 Architectures Optimization Reference Manual (APPENDIX D. INSTRUCTION LATENCY AND THROUGHPUT).

Despite the difficulty of calculating the speed of execution of processor instructions, we can see that the assembler code is minimal and in most cases it will most likely execute faster than the mutex implementation. Next, we will look at how the mutex is arranged in order to confirm the assumptions or refute them.

mutex internals

A mutex is much more complex than an atomic structure than it might seem at first glance. First of all, we have two kinds of mutexes: sync.Mutex And sync.RWMutex. Each of them has methods Lock And Unlocky RWMutex there is an additional method RLock (blocking for reading). In both types, methods Lock quite long compared to atomics.

At Mutex method code Lock includes two types of blocking. The first option is when you manage to capture a non-locked mutex, the second option is quite long and it runs if the mutex is locked:

// Lock locks m.
// If the lock is already in use, the calling goroutine
// blocks until the mutex is available.
func (m *Mutex) Lock() {
	// Fast path: grab unlocked mutex.
	if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
		if race.Enabled {
			race.Acquire(unsafe.Pointer(m))
		}
		return
	}
	// Slow path (outlined so that the fast path can be inlined)
	m.lockSlow()
}

As you can see, the first option is quite fast and includes one atomic operation. The second option includes quite a lot of code and will not be analyzed in detail here: it also uses atomics in the blocking process, but it is obvious that it is executed even longer than the first option (Fast path).

At RWMutex method code Lock includes a method call Lock structures Mutex:

// Lock locks rw for writing.
// If the lock is already locked for reading or writing,
// Lock blocks until the lock is available.
func (rw *RWMutex) Lock() {
	if race.Enabled {
		_ = rw.w.state
		race.Disable()
	}
	// First, resolve competition with other writers.
	rw.w.Lock()
    //  Здесь пропущена часть кода ...
	}
}

And the structure itself RWMutex includes structure Mutex as one of the fields:

type RWMutex struct {
	w           Mutex  // held if there are pending writers
    //  Здесь пропущена часть кода ...
}

Method RLock at RWMutex much faster and contains less code, but nevertheless no faster than the atomics that are involved there:

func (rw *RWMutex) RLock() {
  //  Здесь пропущена часть кода ...
  if atomic.AddInt32(&rw.readerCount, 1) < 0 {
		// A writer is pending, wait for it.
		runtime_SemacquireMutex(&rw.readerSem, false, 0)
  }
  //  Здесь пропущена часть кода ...
}

Conclusion

As a conclusion, I would like to say once again that comparisons of the performance of atomics with mutexes in Go will not be in favor of the latter. In the article, we analyzed the internal structure of atomics using the example SwapInt64 and take a look at the interior Mutex And RWMutex. Knowing the details of their implementation, we can say that atomics are faster than mutexes without measurements. However, it is worth mentioning here that the use of atomics is limited to certain cases (they may not always suit us).