According to the article: “Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads“.
Authors: Thomas Kejser, Lindsey Allen, Arvind Rao and Michael Thomassy
Featuring and reviewed by: Mike Ruthruff, Lubor Kollar, Prem Mehra, Burzin Patel, Michael Thomassy, Mark Souza, Sanjay Mishra, Peter Scharlock, Stuart Ozer, Kun Cheng and Howard Yin
We recently ran lab tests at the Microsoft Enterprise Engineering Center that used the heavy workload that is typical of OLTP systems. The goal of this lab was to determine what would happen when increasing the number of processors from 64 to 128 while serving a Microsoft SQL Server intensive workload (Note: This configuration was based on the Microsoft SQL Server 2008 R2 release). The workload was highly parallelized inserts directed to several large tables.
The workload scaled to 128 processor cores, but there were a lot of PAGELATCH_UP and PAGELATCH_EX brief locks in the wait statistics. The average wait duration was tens of milliseconds, and there were a lot of such waits. Such a number of them turned out to be a surprise for us, it was expected that the duration would not exceed a few milliseconds.
This tech note will first describe how to diagnose this problem and how to use table partitioning to solve this problem.
When a large number of PAGELATCHs are observed in sys.dm_os_wait_stats, you can use sys.dm_os_waiting_tasks to determine the session and resource that the task is waiting for, for example, using this script:
SELECT session_id, wait_type, resource_description FROM sys.dm_os_waiting_tasks WHERE wait_type LIKE 'PAGELATCH%'
The resource_description column indicates the locations of the pages that the session is waiting to access, the location is in this format:
Based on the values in the resource_description column, you can write a rather complex script that will provide a selection of all the pages that are on the waiting list:
SELECT wt.session_id, wt.wait_type, wt.wait_duration_ms , s.name AS schema_name , o.name AS object_name , i.name AS index_name FROM sys.dm_os_buffer_descriptors bd JOIN ( SELECT * , CHARINDEX(':', resource_description) AS file_index , CHARINDEX(':', resource_description , CHARINDEX(':', resource_description)) AS page_index , resource_description AS rd FROM sys.dm_os_waiting_tasks wt WHERE wait_type LIKE 'PAGELATCH%' ) AS wt ON bd.database_id = SUBSTRING(wt.rd, 0, wt.file_index) AND bd.file_id = SUBSTRING(wt.rd, wt.file_index, wt.page_index) AND bd.page_id = SUBSTRING(wt.rd, wt.page_index, LEN(wt.rd)) JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id JOIN sys.partitions p ON au.container_id = p.partition_id JOIN sys.indexes i ON p.index_id = i.index_id AND p.object_id = i.object_id JOIN sys.objects o ON i.object_id = o.object_id JOIN sys.schemas s ON o.schema_id = s.schema_id
The script showed that the expected pages refer to a clustered index defined by the primary key of a table with the following structure:
CREATE TABLE HeavyInsert ( ID INT PRIMARY KEY CLUSTERED , col1 VARCHAR(50) ) ON [PRIMARY]
What happens, why there is a queue of waits for index data pages – all this will be discussed in this technical note.
To determine what’s going on with our large OLTP load, it’s important to understand how SQL Server inserts a new row into an index. When a new row needs to be inserted into the index, SQL Server will follow the following modification algorithm:
An entry is made in the transaction log that the row has changed.
The B-tree is searched for the location of the page where the new record should go.
A PAGELATCH_EX latch is placed on this page to prevent changes from other threads.
A line is added to the page and, if necessary, this page is marked as “dirty”.
A brief lock is being released from the page.
Eventually, the page will need to be flushed to disk by a checkpoint or write-back process.
If all row insertions are directed to the same page, you can observe the growth of the queue to this page. Even though the latch is very short-lived, it can cause contention when the workload is highly concurrent. For our client, the first and only column in the index was a monotonically increasing key. Because of this, each new insert went to the same page at the end of the B-tree until that page was full. Workloads that use IDENTITY or other columns with incremental values as the primary key can also run into a similar problem if the parallelism is high enough.
Whenever multiple threads access the same resource synchronously, the problem described above can occur. The standard solution is to create more contention resources. In our case, the last page of the B-tree is such a competitive resource.
One way to reduce contention per page is to choose a different non-monotonically increasing column as the first column of the index. However, for our client, this would require changes to be made at the application layer in the client systems. We had to find another solution that could be limited to database changes only.
Remember that the contention is one page in a B-tree, and there would be less contention if it were possible to use multiple B-trees for this, and at the same time work with only one table. Fortunately, there is such an opportunity, it is: Partitioned tables and indexes. A table can be partitioned in such a way that new rows are placed in multiple partitions.
First you need to create a function and a partitioning scheme:
CREATE PARTITION FUNCTION pf_hash (INT) AS RANGE LEFT FOR VALUES (0,1,2) CREATE PARTITION SCHEME ps_hash AS PARTITION pf_hash ALL TO ([PRIMARY])
The example above uses four sections. The number of partitions required depends on the number of active processes that are performing INSERT operations on the table described above. There is some difficulty in partitioning a table with a hash column, for example, that whenever rows are fetched from the table, all partitions will be affected. This means that more than one B-tree will have to be accessed, i.e. there will be no unnecessary sections discarded by the optimizer as superfluous. There will be an additional load on the processors and some increase in processor wait times, which encourages minimizing the number of scheduled sections (there should be a minimum number at which PAGELATCH is not observed). In our case, our client’s system had enough headroom for processor utilization that it was quite possible to allow a small loss of time for SELECT instructions, and at the same time increase the rate of INSERT instructions to the required volumes.
Another difficulty is that you need an additional column on which partitioning will be performed, i.e. based on the value of which the inserts will be distributed among the four sections. There was no such column originally in the Microsoft Enterprise Engineering Center scenario. However, it was easy enough to create. This used the fact that the ID column increases monotonically in increments of one, and a fairly simple hash function can be easily applied here:
CREATE TABLE HeavyInsert_Hash( ID INT NOT NULL , col1 VARCHAR(50) , HashID AS ID % 4 PERSISTED NOT NULL)
With the help of the HashID column, inserts into the four sections will be performed in a loop. We create a clustered index as follows:
CREATE UNIQUE CLUSTERED INDEX CIX_Hash ON HeavyInsert_Hash (ID, HashID) ON ps_hash(HashID)
By using the new partitioned table schema instead of the original table, we were able to get rid of PAGELATCH queues and improve insert speed. This was achieved by balancing the insert between several sections and high parallelism. Insertion occurs over multiple pages, and each page is placed in its own B-tree structure. Our client was able to improve insert performance by 15 percent, and handle a large PAGELATCH queue to a single table index hot page. But at the same time, it was also possible to leave a sufficiently large reserve of processor cycles, which made it possible to make a similar optimization for another table, also with a high insertion rate.
Strictly speaking, the essence of this trick is to optimize the logical scheme of the table’s primary key. However, because the key simply became longer by the hash value of the original key, duplicates for the ID column were avoided.
Unique indexes on a single table column are often the cause of problems with PAGELATCH queues. But even if this problem can be fixed, the table may still have a different, non-clustered index that is subject to the same problem. Typically, the problem is for unique keys on a single column, where each insert hits the same page. If other tables have indexes that are affected by the PAGELATCH problem, you can apply the same partitioning trick to indexes on those tables, using the same hash key as the primary key.
It is not always possible to make changes to an application, especially if it is the product of third party efforts. But if changing queries is possible, their optimization becomes available by adding filtering conditions to them by primary key predicates.
Example: To discard unnecessary sections, you can make the following changes to the script:
SELECT * FROM HeavyInsert_Hash WHERE ID = 42
Which after changes will look like this:
SELECT * FROM HeavyInsert_Hash WHERE ID = 42 AND HashID = 42 % 4
The optimizer’s exclusion of unneeded partitions by hash value will cost you nothing, unless you consider a large price to pay for this increase in one byte per row of the clustered index.