Blog
Sorry, your browser does not support inline SVG.

Critical Section Performance in r2.04

K.S. Bhaskar

TL;DR

Critical section management is perhaps the single most important factor in determining database performance in high contention environments like large production systems. In our development of r2.04, we invested heavily in the management of critical sections. This is a summary of our work, and results.

This comes with the caveat that critical sections are only one determinant of application throughput. Other factors like compute time, or user input time may well be more important to your application. Even within an application, workloads vary: for example, the workload of interest posting in a bank is very different from the workload of processing customer transactions.

What are Critical Sections?

Imagine a bus where just one person can board at a time. How fast one can load the bus depends on how long it takes each person to board the bus, and how long it takes between people boarding. Space taken by people waiting to board the bus can also be a consideration when space is limited, such as inside a bus terminal.

If there are only a few people waiting to board, no special organization is needed. They can cluster round the door, and as soon as one person boards, another can follow. If there are more than a few, it will be more efficient overall if they queue, as shown in the picture above, rather than cluster around the door. If there are many people to board the bus, it would be best to have barriers to organize the queue. Barriers are especially important to keep queues organized if space is limited.

Contrast the case of people boarding a bus with the patients consulting a doctor. Although both are “one at a time” activities, the strategies for managing them are different. Since the time that it takes to board a bus is very short, it makes sense for those waiting to board to stand outside the bus. Since the time a doctor spends with a patient is longer, one usually makes an appointment, or take a ticket and wait for your number to be called. You don’t make an appointment to board a bus, and in most countries you don’t line up outside the doctor’s office.

A critical section in software is like the door of the bus. It is a section of code that only one process at a time can execute. Contention occurs when multiple processes all need to execute a critical section, and is like the many people wanting to board the bus. In the course of r2.04, we invested heavily in analyzing and optimizing code, especially code that executes in critical sections. That is not discussed here; this blog post is about managing critical section contention.

Managing Critical Sections

Managing software critical sections is conceptually like people boarding a bus.

If there are a small number of processes contending for access to a critical section, like a small number of people clustering around the bus door, it is probably most efficient for them to just “spin” and keep trying to get the critical section. This is efficient because there is no organizational overhead, and as soon as the critical section becomes available, a waiting process will get it. The disadvantage of this approach is that a spinning process consumes a CPU. Since a computer has a limited number of CPUs, having more than a few spinning processes prevents processes that don’t need the critical section at that time from doing useful work.

If there are many processes contending for access to a critical section, it makes sense for them to queue. As with people queuing for a bus, this can be a simple queue where the processes organize themselves into a queue, or a queue with barriers that requires the action of an external agent (e.g., someone to erect barriers).

While this spin and queue approach is conceptually simple, the devil is in the details.

Implementing Critical Sections

GT.M / YottaDB

For many years, the upstream GT.M implemented a builtin critical section management that YottaDB used unchanged. This implementation is as follows.

  • Spinning is like people clustering around the bus door.
    • When a process is unable to get a critical section, it “hard spins” for a certain number of iterations, called the hard spin count, iterating in a tight loop to get the critical section.
    • After the hard spin limit is reached, it “soft spins” for a certain number of iterations called the soft spin count. The difference between a hard spin and a soft spin is that in the latter the process executes a sched_yield() that relinquishes the CPU. This moves it to the back of the run queue. When it reaches the front and gets the CPU, it again tries to acquire the critical section. It tries soft spins for a number of iterations called the soft spin count. An important difference between a hard spin and a soft spin is that the sched_yield() of each iteration of a soft spin causes a context switch. Context switches use operating system resources, and may involve critical sections within an operating system.
  • Queuing is like people queuing for the bus. When a process fails to get the critical section after its hard spins and soft spins it adds itself to a queue in shared memory. Adding itself to this queue itself requires a critical section, albeit a tiny one that does not require separate management. When a process releases a critical section, it wakes up the first process in the queue, which then attempts to get the critical section, in contention with other processes that may have just begun to acquire the critical section, i.e., if it does not get the critical section, it goes through the hard-spins and soft spins, before putting itself back in the queue.

In practice, this spin-and-queue method works well across most applications, across most loads. But it does have the theoretical unfairness in that a process which reaches the front of the queue could again be pushed to the back of the queue by processes that have just started their attempts to acquire the critical section (not unlike a queue-jumper that walks up to the door of the bus queue pretending not to see the queue). This unfairness can be exacerbated when a system is pushed to the limits, resulting in large numbers of processes contending for the same critical section. In practice, this can mean that while some processes randomly execute faster, others equally randomly execute slower, i.e., there is less consistency.

The r2.04 code base includes a PAUSE opcode to reduce the impact of its spin loops.

Linux

Linux provides pthread_mutex functions which an application can use to implement critical sections. While pthread_mutex is inherently unfair, it has an option (pthread_mutexattr_setrobust()) whose implementation can provide some degree of fairness, albeit not a guarantee.

Fairness isn’t free. Implementing a guaranteed first-in / first-out fairness would be computationally expensive to the point of being unacceptable for a high-throughput application. Practical code to handle contention balances fairness against throughput. While potentially fairer than the traditional YottaDB/GT.M implementation, pthread_mutex functions are computationally more expensive than the builtin critical section management.

From r1.24 (V6.3-006) to r1.26 (V6.3-007)

GT.M versions through V6.3-006 used the builtin implementation of critical sections, as did YottaDB releases through r1.24. While we do not know the motivation for the GT.M change, Occam’s Razor suggests that the motivation was to improve fairness.

The r2.04 Critical Section Journey

While no customer or user of YottaDB had expressed any concerns about performance, and although each YottaDB release was slightly faster than the GT.M versions merged into its code base (owing to small performance enhancements we have made that the upstream developers have not adopted), we decided to make performance and scalability a key feature of r2.04.

The Journey Begins

The twin axioms of performance improvement are:

  • Be careful what you measure, because that is what you will improve.
  • If you choose well, you will improve performance in general.

We decided to focus on an M program that computes lengths of 3n+1 sequences, since that is a database update intensive workload that can easily be scaled up and down in both problem size as well as the number of concurrent processes. Running the program with twice the number of concurrent processes as CPUs (i.e., with some contention, but not a heavy load), we observed the data in the table below, through the row labeled r2.03-pre.

  • The first column is a GT.M version or YottaDB release. Each YottaDB release is preceded by a GT.M version, the one through which source code is merged into the YottaDB release. r2.03-pre is the YottaDB code under development for r2.04 with V7.1-002 code and our enhancements and fixes merged prior to the critical section contention work.
  • The second column shows an average elapsed time (i.e., less is better) for benchmark runs.
  • The third column compares the elapsed times with GT.M V6.3-006, the last GT.M version released before the switch to pthread_mutexes (i.e., larger – positive – is better).
  • The fourth column compares the performance of each YottaDB release with the GT.M version it includes (larger is better).
Build Elapsed time
(milliseconds)
vs. V6.3-006 YottaDB
vs. GT.M
V6.3-006 12,576
r1.24 12,279 2.4% 2.4%
V6.3-007 14,257 -13.4%
r1.26 14,069 -11.9% 1.32%
V7.0-005 14,868 -18.2%
r2.02 16,570 -31.8% -11.45%
V7.1-002 15,519 -23.4%
r2.03-pre 14,956 -18.9% 3.6%
r2.03-pre+ 11,243 -10.6% 24.8%

The last row, labeled r2.03-pre+ is the build after we included changes to code that executes inside critical sections (analogous to the time taken by each person to board the bus), and before our changes to manage critical section contention.

(We were somewhat surprised by the r2.02 number, because such a slowdown has been the experience of neither us nor our users, and no such slowdown was apparent in our testing prior the release. While it is perhaps an artifact of a specific benchmark on a specific machine, or perhaps it is an artifact of some concurrent activity by the operating system – we ran the benchmarks overnight when no one was using the machine – it is nevertheless the number we saw, and as we are objective developers, it is the number we used.)

The “Culprit”

It didn’t take long to discover that the change from the builtin mutex to pthread_mutex from V6.3-006 to V6.3-007 was the cause of the apparent slowdown. When we reverted to the builtin mutex (whose code we carefully examined for speed-up opportunities), we found that performance had been restored!

Credits

  1. Photo of People Boarding a bus at Davenport and Oakwood, Toronto, in 1927 is in the public domain.
  2. Photo of Mango Bus stand Jamshedpur bus terminal by Shahbaz26 released under the Creative Commons Attribution-Share Alike 4.0 International license.

(Below are notes to myself)

Detection and switching

Fairness – can’t switch

Published on October 09, 2025