How not to lose performance when partitioning in MS SQL

Formulation of the problem

Let’s consider a fairly common situation. There is a huge table with approximately the following structure:

CREATE TABLE SomeOperations (
  OperationId bigint NOT NULL,
  OperDate datetime NOT NULL,
  SomeDataMultiplyColumns nvarchar(max)
)

We are only interested in the fact that the unique key of the records in the table is OperationId and each operation has a date and possibly a time. I conditionally combined a bunch of other fields in the table into one field, SomeDataMultiplyColumns. We will also assume that data does not necessarily enter the table in ascending order of OperDate and may well be added retroactively.

When such a table grows, there is a completely reasonable desire to split it into sections (sections, partitions). It is possible to split by OperationId, but it is very inconvenient. Much more often, such a table is divided by date of operation, for example, by year. That is, partitioning looks something like this (creation of file groups and files is omitted):

CREATE PARTITION FUNCTION PF_SomeOperations_OperDate (datetime)  
AS RANGE RIGHT FOR VALUES (
  '2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',  
  '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',   
  '2023-09-01', '2023-10-01', '2023-11-01', '2023-12-01')
GO
CREATE PARTITION SCHEME PS_SomeOperations
AS PARTITION PF_SomeOperations_OperDate TO (
  SomeOperations_FG_ARH,
  SomeOperations_FG01, SomeOperations_FG02, SomeOperations_FG03,
  SomeOperations_FG04, SomeOperations_FG05, SomeOperations_FG06,
  SomeOperations_FG07, SomeOperations_FG08, SomeOperations_FG09,
  SomeOperations_FG10, SomeOperations_FG11, SomeOperations_FG12)
GO
ALTER TABLE SomeOperations
  ADD CONSTRAINT PK_SomeOperations PRIMARY KEY CLUSTERED (
	OperationId ASC, OperDate ASC
) ON PS_SomeOperations(OperDate)
GO

Now, we can make many indexes on the fields we need, but they must contain OperDate so that the server can understand which sections should be searched.

Description of the problem

If the database architecture initially provided for partitioning the SomeOperations table, then all links to this table would contain not only OperationId, but also OperDate. But since they didn’t think about it in time, it wasn’t done. As a result, we have many references to OperationId without OperDate in a number of tables.

You can perform global database refactoring. And perhaps even necessary. But this takes a long time and does not always make sense.

There is nothing more permanent than temporary

When joining (JOIN) with the SomeOperations table by OperationId, a million of the searched rows were selected and filled into a temporary table, approximately 100 seconds on 8 cores (OPTION(MAXDOP 8)).

When joining (JOIN) with the SomeOperations table by OperationId and OperDate, a million of the searched rows were selected and filled into a temporary table in approximately 5 seconds on 8 cores. The difference is very significant. But where do you get OperDate?

And here we remember that MS SQL supports COLUMNSTORE indexes. Strictly speaking, this is not exactly an index, but a column store that is automatically updated when records are added or modified in a related table. Let’s create the following index:

CREATE NONCLUSTERED COLUMNSTORE INDEX
  IXCS_SomeOperations_OperationId_OperDate
  ON SomeOperations(OperationId, OperDate)
  ON PS_SomeOperations(OperDate)

Now, first we search for the OperDate for each OperationId and save the result in the temporary table #OperationIds. The example assumes that the #OpIds temporary table already contains a million OperationIds that need to be found.

CREATE TABLE #OperationIds(
  OperationID bigint,
  OperDate    datetime)
INSERT #OperationIds (CarOperationID, OperDate)
SELECT I.OperationId, O.OperDate
FROM #OpIds AS I
JOIN SomeOperations AS O ON O.OperationId=I.OperationId
OPTION(MAXDOP 8)

This search using the COLUMNSTORE index took only 5 seconds. Despite the fact that there are one and a half billion rows in the SomeOperations table.

So, instead of 100 seconds, splitting the query into two (searching COLUMNSTORE in 5 seconds and sampling based on the results of this search in 5 seconds), we got an order of magnitude performance gain.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *