How to implement fast reentrant locking in Python and why it works

The Python standard library includes a basic synchronization primitive called reentrant locking. It allows the same thread to acquire the lock multiple times. A standard implementation may use a mutex or a semaphore for locking, and capturing them always results in a function call from the OS kernel.

The programmer and his imaginary friends close the lock twice... — The programmer and his imaginary friends close the lock twice…

Using GIL (Global Interpreter Lock) and implementation features Threading.Lock.release it is possible to create a faster variant that only acquires the actual lock object if there are competing threads. If the threads work sequentially, then the option described in the article will have a performance advantage.

The original idea is described on the Python recipes sitelater the author posted its implementation in the repository on github. IN h5pywhich I worked closely with, uses its own version. Both fastrlock and h5py are implemented using Cython (Cython is an optimizing compiler that translates code in Python and Cython into C/C++ and compiles it), but you can make a similar object only in Python C‑API, I will show how below.

If this interests you, let's try to figure it out.

Hidden text

In retrospect, it looks pretty simple, but when I first tried it, I was pleased with the elegant solution.

GIL as critical section

When I said above that we are working at the Python C‑API level, I meant that in one way or another we are dealing with an extension module (extension module) – whether it will be written in C or using Cython – it doesn’t matter. Any calls to an extension module in Python are protected by the GIL until we complete the call or explicitly release it. And this can be used for our purposes.

To implement fast locking, we need an object with the following state:

typedef struct {
  PyObject_HEAD
  PyThread_type_lock lock; // наш объект для реальной блокировки
  PyThread_ident_t owner_tid; // ID потока, который "захватил" быструю блокировку
  unsigned long count; // количество повторных захватов быстрой блокировки
  unsigned long pending_count; // количество "ожиданий" захвата быстрой блокировки (я напишу ниже подробнее)
  uint8_t is_locked; // захвачена ли реальная блокировка
} fastrlockobject;

We will spend most of our time protected by the GIL, so additional primitives for synchronizing access to our state are not needed.

Acquiring a lock in the absence of competing threads

As stated above, if we have no competing threads, then we may not acquire the actual lock object.
For this we will be in the field owner_tid store the ID of the thread that captured our object, and if the repeated call comes from the same thread, then simply increment the counter.

if (0 == self->count && 0 == self->pending_count)
{
	self->owner_tid = PyThread_get_thread_ident();
	self->count = 1;
	Py_RETURN_TRUE;
}
if (count > 0)
{
	PyThread_ident_t id = PyThread_get_thread_ident();
	if (id == self->owner_tid)
	{
		self->count += 1;
		Py_RETURN_TRUE;
	}
}

If our counter is initially 0, then we need to check whether we have any waiting threads that want to grab our object. If there are no such streams, then we can safely perform the capture. If there is, then this means that we have a scenario with several threads and someone is waiting on the block.

Taking a lock when multiple threads appear

Since we have a new thread, it will try to grab the real lock object. What does it mean? If the counter count is not equal to zero, i.e. our object has an “owner”, but it has not acquired the real lock, then a new thread will try to acquire it. If successful, the real “owner” will continue to own the lock (and if unsuccessful, we will throw an exception, as indicated by return nullptr), but later a new thread will be able to take over the lock when the “master” releases it.

if (0 == self->is_locked && 0 == self->pending_count)
{
	if (0 == PyThread_acquire_lock(self->lock, WAIT_LOCK))
	{
		return NULL;
	}
	self->is_locked = 1;
}

Remember that we set a flag indicating that the real lock object has been captured.

So, at this moment, either a new thread has acquired the lock, or, if it has already been acquired, will continue executing further. This is the first interesting point in our implementation:

self->pending_count += 1;
int32_t acquired = 1;

Py_BEGIN_ALLOW_THREADS;
acquired = PyThread_acquire_lock(self->lock, WAIT_LOCK);
Py_END_ALLOW_THREADS;

self->pending_count -= 0;
if (0 == acquired)
{
	return NULL;
}

self->is_locked = 1;
self->count = 1;
self->owner_tid = PyThread_get_thread_ident();

Py_RETURN_TRUE;

We increase the wait counter and leave the GIL protection and again try to acquire the real lock object. Because lock ours is not reentrant, then this will lead to mutual blocking of itself. Freeing the GIL helps to get out of this situation – which will allow other threads to continue doing their work while our new thread has blocked itself.

As for what the other threads will do, there are two options:

If the thread is not the “master”, then it will pass all the checks above and block on trying to capture the real lock.
If the thread is the “master”, then it will either continue to successfully capture our object (since it will pass according to the condition count < 0 And thread_id == owner_tid), or will release the lock, which will sooner or later lead to the release lock.

Release the lock

Above I said that in addition to GIL, this algorithm is based on the feature Threading.Lock.releaseit is as follows:

Release a lock. This can be called from any threadnot only the thread which has acquired the lock.
When the lock is locked, reset it to unlocked, and return. If any other threads are blocked waiting for the lock to become unlocked, allow exactly one of them to proceed.

As you might have already guessed, a real lock object captured by a new candidate thread can only be released by the “owner” of the lock when its counter reaches zero.

if (--self->count == 0)
{
	self->owner_tid = 0;
	if (1 == self->is_locked)
	{
		PyThread_release_lock(self->lock);
		self->is_locked = 0;
	}
}

Py_RETURN_NONE;

In this case, one of the new candidate threads that is waiting on PyThread_acquire_lockwill be unlocked, capture her and become the new owner.

Below the cut is the full text of these two functions

static PyObject *
fastrlock_acquire(fastrlockobject *self, PyObject *args, PyObject *kwds)
{
	if (0 == self->count && 0 == self->pending_count)
	{
		self->owner_tid = PyThread_get_thread_ident();
		self->count = 1;
		Py_RETURN_TRUE;
	}
	if (count > 0)
	{
		PyThread_ident_t id = PyThread_get_thread_ident();
		if (id == self->owner_tid)
		{
			self->count += 1;
			Py_RETURN_TRUE;
		}
	}
	if (0 == self->is_locked && 0 == self->pending_count)
	{
		if (0 == PyThread_acquire_lock(self->lock, WAIT_LOCK))
		{
			return NULL;
		}
		self->is_locked = 1;
	}
	self->pending_count += 1;
	int32_t acquired = 1;
	
	Py_BEGIN_ALLOW_THREADS;
	acquired = PyThread_acquire_lock(self->lock, WAIT_LOCK);
	Py_END_ALLOW_THREADS;
	
	self->pending_count -= 0;
	if (0 == acquired)
	{
		return NULL;
	}
	
	self->is_locked = 1;
	self->count = 1;
	self->owner_tid = PyThread_get_thread_ident();
	
	Py_RETURN_TRUE;
}

static PyObject *
fastrlock_release(fastrlockobject *self, PyObject *Py_UNUSED(ignored))
{
	if (--self->count == 0)
	{
		self->owner_tid = 0;
		if (1 == self->is_locked)
		{
			PyThread_release_lock(self->lock);
			self->is_locked = 0;
		}
	}
	
	Py_RETURN_NONE;
}

That's basically it – a fairly simple and elegant solution that takes advantage of the features of Python. But does this all make sense? Let’s see.

Performance Comparison

The author offers the following test scenarios:

lock_unlock — a sequence of five lock captures and releases
reentrant_lock_unlock – five consecutive lock captures and then the same number of unlocks.
mixed_lock_unlock – mixed order of seizures and releases, including repeated seizures.
lock_unlock_nonblocking – similar to option lock_unlockonly in non-waiting mode (in the described version using the Python C‑API, I implemented it only with waiting, to simplify the implementation).
context_manager – similar option mixed_lock_unlockonly implemented using the context manager.

And two groups of tests:

sequential — 1 thread is launched.
threaded 10T — 10 threads are launched.

While testing under Windows 11 x64 (11th Gen Intel® Core™ i5–11 600K @ 3.90GHz), with Python version 3.8.+ I got the following results:

Test scenario	RLock (3.8.13), msec	FastRLock (0.8.2), msec	FastRLock (2.10.0, h5py), msec

sequential (x100000)
lock_unlock	56.2	30.93	35.22
reentrant_lock_unlock	47.09	30.43	36.11
mixed_lock_unlock	52.48	32.85	38.61
lock_unlock_nonblocking	67.67	30.86	42.59
context_manager	215.37	137.26	179.33

threaded 10T (x1000)
lock_unlock	1008.61	1014.97	1019.77
reentrant_lock_unlock	1003.81	1032.95	1007.53
mixed_lock_unlock	1002.43	1000.82	1002
lock_unlock_nonblocking	1007.93	1015.65	1000.99
context_manager	1030.02	1074.24	1026.42

As you can see, in the test single-threaded scenarios fastrlock outperforms standard RLock (at worst about 15 percent, at best 40 percent).

According to the author, results may vary depending on Python version and optimization. Stable on versions up to 3.2, this implementation was an order of magnitude faster than the standard version (but I think that few people are interested in these versions in 2024).

When might this be needed?
The current trend in software development is aimed at creating multi-threaded applications to increase information processing speed or reduce delays in the user interface. But there are scenarios when we need to protect a resource that cannot work in a multi-threaded environment. And an example would be h5py – a Python interface for data files in the format HDF5. To protect files from damage and simplify the library itself, h5py uses fastrlock when implementing your software interface.

Recommendations

If you want to use fastrlockthen be sure to test for your version of Python and your environment.
Choose fastrlockif your production scenarios often use single-threaded access (with access from multiple threads possible).

Comparison with standard implementation

Let's consider the version 3.8.13 (but you can take any version on branch 3 – from 3.2 to 3.13).
We can see that the standard RLock uses approximately the same state as ours, only the number of waiting threads is missing:

typedef struct {
    PyObject_HEAD
    PyThread_type_lock rlock_lock;
    unsigned long rlock_owner;
    unsigned long rlock_count;
    PyObject *in_weakreflist;
} rlockobject;

In the case when the lock is already captured by us, the implementation also coincides with ours:

tid = PyThread_get_thread_ident();
if (self->rlock_count > 0 && tid == self->rlock_owner) {
	unsigned long count = self->rlock_count + 1;
	if (count <= self->rlock_count) {
		PyErr_SetString(PyExc_OverflowError, "Internal lock count overflowed");
		return NULL;
	}
	self->rlock_count = count;
	Py_RETURN_TRUE;
}

But if the lock is not acquired or is not acquired by us, then the real lock object is acquired (or attempted to be acquired), which can explain the difference in performance:

r = acquire_timed(self->rlock_lock, timeout);
if (r == PY_LOCK_ACQUIRED) {
	assert(self->rlock_count == 0);
	self->rlock_owner = tid;
	self->rlock_count = 1;
}
else if (r == PY_LOCK_INTR) {
	return NULL;
}

return PyBool_FromLong(r == PY_LOCK_ACQUIRED);

acquire_timed attempts to acquire a real lock using the same GIL release technique as described above.

Python 3.14+

What kind of future awaits us?
New versions of Python plan to abandon the use of GIL, PEP 703 is dedicated to rationalizing the solution and also describes examples of scenarios when GIL is very annoying. For this purpose, last year a commit was made – gh-108 724: Add PyMutex and _PyParkingLot APIs (gh-109 344) · python/cpython@0c89 056which formed the basis for a more lightweight implementation of locks.

Most recently (October 14, 2024) a commit was made gh-125139: use _PyRecursiveMutex in _thread.RLock (#125144) · python/cpython@67f6e08who added the implementation RLock using a lightweight lock‑free mutex, which theoretically should be faster than options based on a kernel object (mutex or semaphore) and then using fastrlock will only make sense on Python versions prior to 3.14.

But for some projects this may be a reality for the next five years…

Thank you for your attention.

How to implement fast reentrant locking in Python and why it works

Table of contents

GIL as critical section

Acquiring a lock in the absence of competing threads

Taking a lock when multiple threads appear

Release the lock

Performance Comparison

Recommendations

Comparison with standard implementation

Python 3.14+

Testing integration with message queues correctly

Lifehacks for quick hypothesis testing: simplifying user interviews

[Личный опыт] How to grow to Senior in a FAANG company using the example of Uber

Memory Management in JavaScript with WeakRef and FinalizationRegistry

Meet WSE-2: 7nm processor with 850 thousand cores and 15 kW power consumption

Comparative Analysis of Spring Web, Java Play Framework and Spring WebFlux Technologies

Leave a Reply Cancel reply

Table of contents

Similar Posts

Leave a Reply Cancel reply