How to deal with PAGELATCH for highly parallel INSERT workloads

According to the article: “Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads“.

Authors: Thomas Kejser, Lindsey Allen, Arvind Rao and Michael Thomassy

Featuring and reviewed by: Mike Ruthruff, Lubor Kollar, Prem Mehra, Burzin Patel, Michael Thomassy, ​​Mark Souza, Sanjay Mishra, Peter Scharlock, Stuart Ozer, Kun Cheng and Howard Yin

Introduction

We recently ran lab tests at the Microsoft Enterprise Engineering Center that used the heavy workload that is typical of OLTP systems. The goal of this lab was to determine what would happen when increasing the number of processors from 64 to 128 while serving a Microsoft SQL Server intensive workload (Note: This configuration was based on the Microsoft SQL Server 2008 R2 release). The workload was highly parallelized inserts directed to several large tables.

The workload scaled to 128 processor cores, but there were a lot of PAGELATCH_UP and PAGELATCH_EX brief locks in the wait statistics. The average wait duration was tens of milliseconds, and there were a lot of such waits. Such a number of them turned out to be a surprise for us, it was expected that the duration would not exceed a few milliseconds.

This tech note will first describe how to diagnose this problem and how to use table partitioning to solve this problem.

Problem Diagnosis

When a large number of PAGELATCHs are observed in sys.dm_os_wait_stats, you can use sys.dm_os_waiting_tasks to determine the session and resource that the task is waiting for, for example, using this script:

SELECT session_id, wait_type, resource_description 
FROM sys.dm_os_waiting_tasks
WHERE wait_type LIKE 'PAGELATCH%'

Result example:

The resource_description column indicates the locations of the pages that the session is waiting to access, the location is in this format:

::

Based on the values ​​in the resource_description column, you can write a rather complex script that will provide a selection of all the pages that are on the waiting list:

SELECT wt.session_id, wt.wait_type, wt.wait_duration_ms    
, s.name AS schema_name 
, o.name AS object_name 
, i.name AS index_name  
FROM sys.dm_os_buffer_descriptors bd 
JOIN (      
SELECT *          
, CHARINDEX(':', resource_description) AS file_index       
, CHARINDEX(':', resource_description
, CHARINDEX(':', resource_description)) AS page_index          
, resource_description AS rd 
FROM sys.dm_os_waiting_tasks wt    
WHERE wait_type LIKE 'PAGELATCH%'        
) AS wt     
ON bd.database_id = SUBSTRING(wt.rd, 0, wt.file_index)     
AND bd.file_id = SUBSTRING(wt.rd, wt.file_index, wt.page_index)  
AND bd.page_id = SUBSTRING(wt.rd, wt.page_index, LEN(wt.rd))
JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
JOIN sys.partitions p ON au.container_id = p.partition_id
JOIN sys.indexes i ON  p.index_id = i.index_id AND p.object_id = i.object_id
JOIN sys.objects o ON i.object_id = o.object_id 
JOIN sys.schemas s ON o.schema_id = s.schema_id

The script showed that the expected pages refer to a clustered index defined by the primary key of a table with the following structure:

CREATE TABLE HeavyInsert (  
  ID INT PRIMARY KEY CLUSTERED   
  , col1 VARCHAR(50) 
) ON [PRIMARY]

What happens, why there is a queue of waits for index data pages – all this will be discussed in this technical note.

basic information

To determine what’s going on with our large OLTP load, it’s important to understand how SQL Server inserts a new row into an index. When a new row needs to be inserted into the index, SQL Server will follow the following modification algorithm:

  1. An entry is made in the transaction log that the row has changed.

  2. The B-tree is searched for the location of the page where the new record should go.

  3. A PAGELATCH_EX latch is placed on this page to prevent changes from other threads.

  4. A line is added to the page and, if necessary, this page is marked as “dirty”.

  5. A brief lock is being released from the page.

Eventually, the page will need to be flushed to disk by a checkpoint or write-back process.

If all row insertions are directed to the same page, you can observe the growth of the queue to this page. Even though the latch is very short-lived, it can cause contention when the workload is highly concurrent. For our client, the first and only column in the index was a monotonically increasing key. Because of this, each new insert went to the same page at the end of the B-tree until that page was full. Workloads that use IDENTITY or other columns with incremental values ​​as the primary key can also run into a similar problem if the parallelism is high enough.

Solution

Whenever multiple threads access the same resource synchronously, the problem described above can occur. The standard solution is to create more contention resources. In our case, the last page of the B-tree is such a competitive resource.

One way to reduce contention per page is to choose a different non-monotonically increasing column as the first column of the index. However, for our client, this would require changes to be made at the application layer in the client systems. We had to find another solution that could be limited to database changes only.

Remember that the contention is one page in a B-tree, and there would be less contention if it were possible to use multiple B-trees for this, and at the same time work with only one table. Fortunately, there is such an opportunity, it is: Partitioned tables and indexes. A table can be partitioned in such a way that new rows are placed in multiple partitions.

First you need to create a function and a partitioning scheme:

CREATE PARTITION FUNCTION pf_hash (INT) AS RANGE LEFT FOR VALUES (0,1,2) 
CREATE PARTITION SCHEME ps_hash AS PARTITION pf_hash ALL TO ([PRIMARY])

The example above uses four sections. The number of partitions required depends on the number of active processes that are performing INSERT operations on the table described above. There is some difficulty in partitioning a table with a hash column, for example, that whenever rows are fetched from the table, all partitions will be affected. This means that more than one B-tree will have to be accessed, i.e. there will be no unnecessary sections discarded by the optimizer as superfluous. There will be an additional load on the processors and some increase in processor wait times, which encourages minimizing the number of scheduled sections (there should be a minimum number at which PAGELATCH is not observed). In our case, our client’s system had enough headroom for processor utilization that it was quite possible to allow a small loss of time for SELECT instructions, and at the same time increase the rate of INSERT instructions to the required volumes.

Another difficulty is that you need an additional column on which partitioning will be performed, i.e. based on the value of which the inserts will be distributed among the four sections. There was no such column originally in the Microsoft Enterprise Engineering Center scenario. However, it was easy enough to create. This used the fact that the ID column increases monotonically in increments of one, and a fairly simple hash function can be easily applied here:

CREATE TABLE HeavyInsert_Hash(  
  ID INT NOT NULL  
  , col1 VARCHAR(50)  
  , HashID AS ID % 4 PERSISTED NOT NULL)

With the help of the HashID column, inserts into the four sections will be performed in a loop. We create a clustered index as follows:

CREATE UNIQUE CLUSTERED INDEX CIX_Hash 
ON HeavyInsert_Hash (ID, HashID) ON ps_hash(HashID)

By using the new partitioned table schema instead of the original table, we were able to get rid of PAGELATCH queues and improve insert speed. This was achieved by balancing the insert between several sections and high parallelism. Insertion occurs over multiple pages, and each page is placed in its own B-tree structure. Our client was able to improve insert performance by 15 percent, and handle a large PAGELATCH queue to a single table index hot page. But at the same time, it was also possible to leave a sufficiently large reserve of processor cycles, which made it possible to make a similar optimization for another table, also with a high insertion rate.

Strictly speaking, the essence of this trick is to optimize the logical scheme of the table’s primary key. However, because the key simply became longer by the hash value of the original key, duplicates for the ID column were avoided.

Unique indexes on a single table column are often the cause of problems with PAGELATCH queues. But even if this problem can be fixed, the table may still have a different, non-clustered index that is subject to the same problem. Typically, the problem is for unique keys on a single column, where each insert hits the same page. If other tables have indexes that are affected by the PAGELATCH problem, you can apply the same partitioning trick to indexes on those tables, using the same hash key as the primary key.

It is not always possible to make changes to an application, especially if it is the product of third party efforts. But if changing queries is possible, their optimization becomes available by adding filtering conditions to them by primary key predicates.

Example: To discard unnecessary sections, you can make the following changes to the script:

 SELECT * FROM HeavyInsert_Hash 
      WHERE ID = 42

Which after changes will look like this:

SELECT * FROM HeavyInsert_Hash 
      WHERE ID = 42 AND HashID  = 42 % 4

The optimizer’s exclusion of unneeded partitions by hash value will cost you nothing, unless you consider a large price to pay for this increase in one byte per row of the clustered index.

Additional materials

  1. SQL Server Performance Tuning: Resolving Last Page Insert Contention

  2. Resolve last-page insert PAGELATCH_EX contention in SQL Server

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *