The fastest mutexes

Cosmopolitan Libc well known for its “polyglot bold binary» hack that allows executable files to run on six operating systems for AMD64/ARM64. You may be surprised that it may be a better C library for your production. To demonstrate this, let's compare the Cosmo mutex library to other platforms.

We'll write a simple test that creates 30 threads that increment the same number 100,000 times. This will help test how well your mutex implementation handles the task under heavy use. Essentially it will be as follows (see the segment at the bottom of the article for the full source code).

int g_chores;
pthread_mutex_t g_locker = PTHREAD_MUTEX_INITIALIZER;

void *worker(void *arg) {
  for (int i = 0; i < ITERATIONS; ++i) {
    pthread_mutex_lock(&g_locker);
    ++g_chores;
    pthread_mutex_unlock(&g_locker);
  }
  return 0;
}

Now let's move on to the most interesting thing – the results of my tests.

Benchmarks

Time will be measured in microseconds. Wall time is the time during which the test program is executed. It includes the overhead of creating and terminating threads. User time is the time the processor spent in user space, and system time is the time the processor spent in the kernel. System and user time may exceed the actual wall time because multiple threads are running in parallel.

The first results I'll show are on Windows because Mark Waterman did some great work mutex test three months ago where he said: “in a highly competitive environment, Windows wins with its SRWLOCK.” Situations with high contention show differences in the implementation of mutexes. Mark was so impressed with Microsoft's SRWLOCK that he recommended that Linux and FreeBSD users consider switching to Windows if mutex contention is a problem.

As we can see, Cosmopolitan Libc mutexes are 2.75 times faster than Microsoft's SRWLOCK (previously considered the best of the best), while consuming 18 times less CPU resources. Cosmopolitan's mutexes were also 65 times faster than Cygwin, which, like Cosmopolitan, provides a POSIX implementation on Windows. Cygwin's mutexes are so bad that they would be better off using spinlocks for this case.

Now let's move on to Linux, the ruler of all operating systems.

Here we see that Cosmopolitan mutexes:

3 times faster than glibc
11 times faster than musl libc
Consumes 42 times less CPU time than glibc
Consumes 178 times less CPU time than musl libc

Here's how it works in practice. Imagine you have a task where all threads need to perform a sequential operation. If you are using Cosmopolitan and look at htop, it will appear that only one processor core is active, while glibc and musl libc will use all the cores on your processor. This is bad news if you run many tasks on the same server. If at least one of your servers experiences a mutex bloodbath, then all your resources will be exhausted unless you are using Cosmopolitan. It's still a new C library, and it has some minor rough edges. But it is improving so quickly that I am beginning to consider not using it in production a waiver of professional responsibility. The C library is so deeply embedded in and dependent on the software supply chain that you really don't want it to become a planet killer. If such important and undeniable tools are so wasteful, it's no wonder Amazon Cloud makes so much money.

And finally we have MacOS.

On MacOS with an ARM64-based M2 microprocessor, Apple Libc is slightly superior to Cosmopolitan mutexes. For reasons I don't yet fully understand, the standard Cosmopolitan mutex implementation doesn't work as efficiently on this platform. This may be due to the fact that M2 and the XNU kernel work together. Therefore, on MacOS ARM Cosmopolitan uses a simpler algorithm based on Ulrich Drepper's article “Futexes Are Tricky“, which essentially hands off all the hard work to the XNU ulock system calls. That's why the performance is almost identical to what Apple does.

In summary, the benchmark results show that, in the best case, Cosmopolitan mutexes will significantly outperform alternatives in scenarios with high contention and small critical sections, and in the worst case, they will be about as effective.

How I did it

The reason Cosmopolitan mutexes are so efficient is because I used the nsync library. It only has 371 stars on GitHub, but it was written by a brilliant engineer at Google named Mike Burrows. If you don't know who this is, he is the man who developed Google's main competitor, Altavista. If you're not old enough to remember Altavista, it was the first really good search engine and it ran on one computer.

I really enjoyed integrating nsync at Cosmopolitan. Moreover, I even had the opportunity to make changes upstream. For example, I found and fixed a bug in the mutex unlock function that had gone undetected for years. I also managed to speed up the operation of mutexes nsync by 30% compared to the original version on AARCH64, porting them to use atomic operations from the standard C11. I wrote new system integration for things like futexwhich allow for runtime portability. And finally, I got the library to work flawlessly with thread cancellation in POSIX.

So how do nsync does this do? What tricks does she use? Here are my thoughts and analysis:

nsync uses optimistic CAS (compare and swap) immediately so that locking occurs quickly if there is no contention for resources.
When the lock fails to be acquired, nsync adds the calling thread to the doubly linked waiting list. Each waiting thread gets its own semaphore on its own independent cache line. This is important because once the thread enters the wait state, it no longer interacts with the underlying lock.
To understand why this is important, it is worth reading the work of Ulrich Drepper What Every Programmer Should Know About Memory. In it, he describes in detail the coherence protocols used by modern microprocessors. These protocols allow cores to “talk” to each other at a low level by keeping track of which cache lines they are using. When multiple cores access the same lines, this introduces significant communication overhead within the processor.
nsync uses the help of the operating system through futex's. This is a great abstraction that was invented by Linux a few years ago and quickly spread to other operating systems. On macOS futexes are called ulocks, and on Windows they are called WaitOnAdress(). The only OS supported by Cosmo that does not have futex is NetBSD, which implements POSIX semaphores at the kernel level, but each semaphore requires the creation of a new file descriptor, which is not very convenient.
The main advantage of futexes and semaphores is that they allow the operating system to put a thread to sleep. This gives nsync the ability to avoid using CPU time when there is no work to be done.
nsync avoids the starvation situation by using the concept of “long wait”. If a thread waiting to be unlocked has been woken 30 times and failed to acquire the lock each time, then nsync adds a bit to the lock that prevents threads that have not yet had time to wait their turn from acquiring the lock. This means that the initial CAS will fail for all other threads until the queue is freed.
nsync speeds up the case we tested (a concurrent lock with a small critical section) thanks to the concept of a “designated waker”. This bit on the master lock is set when a thread wakes up and attempts to acquire the lock. In nsync, the unlock function is responsible for waking up the next thread in the queue that is waiting for a lock. The presence of this bit allows the thread releasing the unlock to understand that there is no need to wake up the second waiting thread since one thread is already awakened.

To learn more secrets of nsync, you can check out the source code at the following links: https://github.com/jart/cosmopolitan/blob/master/third_party/nsync/mu.c And https://github.com/jart/cosmopolitan/blob/master/libc/intrin/pthread_mutex_lock.c

Online Proof

If you want to see a live demo of software built using Cosmopolitan mutexes, then do the worst DDoS attack on a web server http://ipv4.games/. This is truly a game for hackers competing for Internet dominance. You are already playing this game because your IP has just been registered to jart. This service runs on a GCE virtual machine with two cores and has so far successfully withstood DDoS attacks from botnets of up to 49,131,669 IP addresses. This was largely possible thanks to the nsync library, which allowed me to move SQL queries to background threads that send messages to each other. There is still room for improvement, but overall the system works well. You can even track its status metrics on the page /statusz.

Source code

#define ITERATIONS 100000
#define THREADS    30

int g_chores;
pthread_mutex_t g_locker = PTHREAD_MUTEX_INITIALIZER;

void *worker(void *arg) {
  for (int i = 0; i < ITERATIONS; ++i) {
    pthread_mutex_lock(&g_locker);
    ++g_chores;
    pthread_mutex_unlock(&g_locker);
  }
  return 0;
}

struct timeval tub(struct timeval a, struct timeval b) {
  a.tv_sec -= b.tv_sec;
  if (a.tv_usec < b.tv_usec) {
    a.tv_usec += 1000000;
    a.tv_sec--;
  }
  a.tv_usec -= b.tv_usec;
  return a;
}

long tomicros(struct timeval x) {
  return x.tv_sec * 1000000ul + x.tv_usec;
}

int main() {
  struct timeval start;
  gettimeofday(&start, 0);

  pthread_t th[THREADS];
  for (int i = 0; i < THREADS; ++i)
    pthread_create(&th[i], 0, worker, 0);
  for (int i = 0; i < THREADS; ++i)
    pthread_join(th[i], 0);
  assert(g_chores == THREADS * ITERATIONS);

  struct rusage ru;
  struct timeval end;
  gettimeofday(&end, 0);
  getrusage(RUSAGE_SELF, &ru);
  printf("%16ld us real\n"
         "%16ld us user\n"
         "%16ld us sys\n",
         tomicros(tub(end, start)),
         tomicros(ru.ru_utime),
         tomicros(ru.ru_stime));
}

The reason we're interested in high contention scenarios is because these are the situations where mutex implementation inequalities emerge. Non-contentious mutexes usually work the same in all implementations, and even in such cases you may prefer to use a spinlock, which requires only a few lines of code:

void spin_lock(atomic_int *lock) {
  if (atomic_exchange_explicit(lock, 1, memory_order_acquire)) {
    for (;;) {
      for (;;)
        if (!atomic_load_explicit(lock, memory_order_relaxed))
          break;
      if (!atomic_exchange_explicit(lock, 1, memory_order_acquire))
        break;
    }
  }
}

void spin_unlock(atomic_int *lock) {
  atomic_store_explicit(lock, 0, memory_order_release);
}

Please note that a spinlock should only be used when you have no other choice. They are useful in operating system kernels where strict low-level restrictions prevent more complex solutions. Spinlocks are also a useful implementation detail in nsync locks. But overall they are bad. I'm guessing that many developers consider them to be good. If this is the case, then most likely it is because they only made benchmarks for wall time. When using locks, it is also important to consider CPU time. This is why we use the getrusage() function.