GENERATE_SERIES in SQL Server 2022 and earlier

I often encounter the need to generate strings on the fly. And not only for simply obtaining a sequence of numbers or dates, but also, for example, to fill a database with test data, create pivot tables with an arbitrary number of columns, extrapolate data, fill in gaps in date or time ranges.

When working with SQL Server 2022 or Azure SQL Database, you can take advantage of the newly introduced T-SQL built-in function GENERATE_SERIES. It's pretty easy to use: you pass the required boundary values start And stopand an optional step step:

SELECT value FROM GENERATE_SERIES(<start>, <stop> [, <step>]);

Some small examples:

/* count to 6 */ 
 SELECT [1-6] = value 
   FROM GENERATE_SERIES(1, 6);
 
 /* count by 5s to 30 */
 SELECT [step 5] = value 
   FROM GENERATE_SERIES(5, 30, 5);
 
 /* count from 10 to 0, backwards, by 2 */
 SELECT [backward] = value
   FROM GENERATE_SERIES(10, 0, -2);
 
 /* get all the days in a range, inclusive */
 DECLARE @start date="20230401",
         @end   date="20230406";
 
 SELECT [days in range]  = DATEADD(DAY, value, @start)
   FROM GENERATE_SERIES(0, DATEDIFF(DAY, @start, @end));

Results:

1-6

(first request)

step 5

(second request)

backward

(third request)

days in range

(fourth request)

1

5

10

2023-04-01

2

10

8

2023-04-02

3

15

6

2023-04-03

4

20

4

2023-04-04

5

25

2

2023-04-05

6

30

0

2023-04-06

The syntax is quite convenient and simple, that's what I'm talking about wrote back in SQL Server 2022 beta, but…

What about older versions of SQL Server?

Of course, this problem had to be solved before, even before SQL Server became SQL Server, so there were always some options. Some of them were strange, mysterious and incomprehensible, with performance issues, and others were even worse. For myself, I chose two methods: one that works starting with SQL Server 2016, the other – with SQL Server 2008. There are solutions for SQL Server 2000, but that's not what we're talking about today.

We implement both options as table functions with the ability to generate up to 4000 values. Of course, you can go beyond this, but exceeding 8001 values ​​entails LOB support in the first solution, which can have unpredictable impact on performance. The second option will be limited to 4096 values, since this is the highest power of 4 less than 8001 (you will see later why this is important).

2016+ STRING_SPLIT + REPLICATE

This technique has appeared in my arsenal relatively recently. I don't remember where I first saw it, but I like its brevity.

CREATE FUNCTION dbo.GenerateSeries_Split
 (
   @start int,
   @stop  int
 )
 RETURNS TABLE WITH SCHEMABINDING
 AS
   RETURN
   (
     SELECT TOP (@stop - @start + 1) 
       value = ROW_NUMBER() OVER (ORDER BY @@SPID) + @start - 1
     FROM STRING_SPLIT(REPLICATE(',', @stop - @start), ',')
     ORDER BY value
   );

The number of values ​​we need to get will be (stop – start + 1).

Using REPLICATE, we generate a string of (stop – start) commas. Then we split the resulting string using STRING_SPLIT and get the required number of (stop – start + 1) empty strings. After that, using ROW_NUMBER(), we number the strings and get a sequence of numbers from 1 to (stop – start + 1). To start the sequence from the required value, we add start to the string number and subtract 1.

To support more than 8001 values, you can change the FROM line as follows:

FROM STRING_SPLIT(REPLICATE(CONVERT(varchar(max),','), @stop - @start), ',')

But we will not talk about this option today, but will consider the original one.

2008+ Cross Join CTE

This method can be used in earlier versions of SQL Server, but it is more arcane. I remember the first time I used this trick in this decision after discovering a really efficient implementation of STRING_SPLIT for SQL Server 2012 by Jonathan Roberts.

CREATE FUNCTION dbo.GenerateSeries_CTEs
 (
   @start int,
   @stop  int
 )
 RETURNS TABLE WITH SCHEMABINDING 
 AS 
   RETURN
   (
     /* could work in 2005 by changing VALUES to a UNION ALL */
     WITH n(n) AS (SELECT 0 FROM (VALUES (0),(0),(0),(0)) n(n)),
      i4096(n) AS (SELECT 0 FROM n a, n b, n c, n d, n e, n f)      
     SELECT TOP (@stop - @start + 1) 
       value = ROW_NUMBER() OVER (ORDER BY @@TRANCOUNT) + @start - 1 
     FROM i4096
     ORDER BY value
   );

There are two CTEs used here: the first generates four rows using the VALUES clause, and the second performs the Cartesian product of these rows with each other as many times as necessary to cover the required range of values ​​(in our case, 4000).

Each Cartesian product gives 4^n rows, where n is the number of times table n is mentioned in FROM. If you specify only one table (SELECT 0 FROM na), there will be 4^1 rows. For two tables, there will be 4^2, or 16. And so on: 4^3 = 64, 4^4 = 256, 4^5 = 1024, and 4^6 = 4096. I'll try to show this in a picture:

Explaining cross join powers of 4

If you only need to support 256 values, you can leave the second CTE with only four tables:

i256(n) AS (SELECT 0 FROM n a, n b, n c, n d)

And if you need more than 4096 values, for example, 16384, add the table again:

 i16K(n) AS (SELECT 0 FROM n a, n b, n c, n d, n e, n f, n g)

Of course, you can use the alternative syntax, but in my opinion it is more complicated and looks less clear.

 i4096(n) AS 
        (
           SELECT 0 FROM n AS n4    CROSS JOIN n AS n16 
              CROSS JOIN n AS n64   CROSS JOIN n AS n256
              CROSS JOIN n AS n1024 CROSS JOIN n AS n4096
           /* ... */
        )

You can use defensive programming by changing the parameter types to smallint or tinyint to avoid surprises from using very large int values. If you pass a large range, there will be no error, but only 4096 values ​​will be generated. Keep in mind that you may need to generate 100 rows in the range 2,000,000,000 (start) to 2,000,000,100 (stop), so limiting the parameters rather than controlling the number of rows generated may be overkill.

A common use case is to use recursive CTEs. I love recursive CTEs and often suggest them, but they are not useful in this context unless you only need 100 values ​​(e.g. for monthly report days). The problem is that to get more than 100 values ​​you would need to specify a MAXRECURSION hint, but you can't put that inside the function, so you have to write it in every query that uses the function. Horrible! So much for encapsulation.

What about performance?

I came up with a simple test – pagination.

Note: This example is contrived and should not be considered as a good solution for data pagination.

Let's create a simple table with 4,000 rows:

 SELECT TOP (4000) rn = IDENTITY(int,1,1),*
 INTO dbo.things FROM sys.all_columns;
 
 CREATE UNIQUE CLUSTERED INDEX cix_things ON dbo.things(rn);

Then three stored procedures with different pagination options.

The first one will use the GenerateSeries_Split function with STRING_SPLIT:

 CREATE OR ALTER PROCEDURE dbo.PaginateCols_Split
   @PageSize int = 100,
   @PageNum  int = 1
 AS
 BEGIN
   SET NOCOUNT ON;
 
   DECLARE @s int = (@PageNum-1) * @PageSize + 1;
   DECLARE @e int = @s + @PageSize - 1;
 
   WITH r(rn) AS
   (
     SELECT TOP (@PageSize) rn = value
     FROM dbo.GenerateSeries_Split(@s, @e)
   )
   SELECT t.* FROM dbo.things AS t 
   INNER JOIN r ON t.rn = r.rn;
 END

The second is the GenerateSeries_CTEs function with the Cartesian product of CTEs:

CREATE OR ALTER PROCEDURE dbo.PaginateCols_CTEs
   @PageSize int = 100,
   @PageNum  int = 1
 AS
 BEGIN
   SET NOCOUNT ON;
 
   DECLARE @s int = (@PageNum-1) * @PageSize + 1;
   DECLARE @e int = @s + @PageSize - 1;
 
   WITH r(rn) AS
   (
     SELECT TOP (@PageSize) rn = value
     FROM dbo.GenerateSeries_CTEs(@s, @e)
   )
   SELECT t.* FROM dbo.things AS t 
   INNER JOIN r ON t.rn = r.rn;
 END

And the third one is the built-in function GENERATE_SERIES:

CREATE OR ALTER PROCEDURE dbo.PaginateCols_GenSeries
   @PageSize int = 100,
   @PageNum  int = 1
 AS
 BEGIN
   SET NOCOUNT ON;
 
   DECLARE @s int = (@PageNum-1) * @PageSize + 1;
   DECLARE @e int = @s + @PageSize - 1;
 
   WITH r(rn) AS
   (
     SELECT TOP (@PageSize) rn = value
     FROM GENERATE_SERIES(@s, @e)
   )
   SELECT t.* FROM dbo.things AS t 
   INNER JOIN r ON t.rn = r.rn;
 END

To run these procedures, we'll create a helper stored procedure that we'll pass the page number @pagenum to. This way, we can test getting a page at the beginning, middle, and end of the data set (pagination performance often degrades as the page number increases). It's unlikely that we'll notice a performance difference if we call this stored procedure once, but if we run it multiple times, we'll hopefully see a difference.

CREATE OR ALTER PROCEDURE dbo.PaginateCols_Wrapper
   @PageNum int = 1
 AS
 BEGIN
   SET NOCOUNT ON;
 
   EXEC dbo.PaginateCols_Split     @PageNum = @PageNum;
   EXEC dbo.PaginateCols_CTEs      @PageNum = @PageNum;
   EXEC dbo.PaginateCols_GenSeries @PageNum = @PageNum;
 END

To collect information about execution time we will use Query Store.

 ALTER DATABASE GenSeries SET QUERY_STORE 
 (
   OPERATION_MODE              = READ_WRITE,
   QUERY_CAPTURE_MODE          = ALL /* Do not do this in production! */
 );

I would like to remind you that you should not use QUERY_CAPTURE_MODE = ALL in production, but in development it is quite acceptable to use it to ensure that all requests are captured.

For automatic launch I used sqlstresscmdspecifying the launch parameters in the GenSeries.json file:

{
   "CollectIoStats": true,
   "CollectTimeStats": true,
   "MainDbConnectionInfo": 
   {
     "Database": "GenSeries",
     "Login": "sa",
     "Password": "$tr0ng_P@$$w0rd",
     "Server": "127.0.0.1,2022"
   },
   "MainQuery": "EXEC dbo.PaginateCols_Wrapper @PageNum = 1;",
   "NumIterations": 10000,
   "NumThreads": 16,
   "ShareDbSettings": true
 }

Here dbo.PaginateCols_Wrapper is set to run 10,000 times in 16 threads. On average it took me about 5 minutes to run.

We launch it using the following command:

sqlstresscmd -s ~/Documents/GenSeries.json

And we get the average execution time from Query Store:

SELECT qt.query_sql_text,
        avg_duration       = AVG(rs.avg_duration/1000.0)
   FROM sys.query_store_query_text AS qt
   INNER JOIN sys.query_store_query AS q 
     ON qt.query_text_id = q.query_text_id
   INNER JOIN sys.query_store_plan  AS p 
     ON q.query_id = p.query_id
   INNER JOIN sys.query_store_runtime_stats AS rs 
     ON p.plan_id = rs.plan_id
   WHERE qt.query_sql_text LIKE N'%dbo.things%'
     AND qt.query_sql_text NOT LIKE N'%sys.query_store%'
   GROUP BY qt.query_sql_text;

Before running sqlstresscmd again, I cleared the data in Query Store with the following command:

ALTER DATABASE GenSeries SET QUERY_STORE CLEAR;

And also changed the MainQuery parameter accordingly to run tests for the page in the middle and at the end.

For lines 1901 – 2000:

"MainQuery": "EXEC dbo.PaginateCols_Wrapper  = 20;",

For lines 3901 – 4000:

As a result, the following results were obtained in milliseconds (click to enlarge):

Line graph showing average duration, in milliseconds, of three different series generation techniques

In my experiment, the STRING_SPLIT option wins, but the new built-in GENERATE_SERIES function is hot on its heels. The CTE option, despite being compatible with earlier versions of SQL Server, lags quite a bit behind.

Of course, I would like to see straight horizontal lines, since the performance of pagination should not depend on the page number, but that is not the topic of today. I plan to return to some of my old pagination techniques in a future article.

Conclusion

I'm very excited about the GENERATE_SERIES function, and hope you can try it out soon! The performance of the STRING_SPLIT approach is slightly better, but both are fairly linear. I'd prefer to use the new syntax if possible. In my experiment, we're talking milliseconds. But as always, it's worth testing on your specific use case.

Again, this experiment was not intended to show which pagination method is better. It is a completely synthetic example, in which one of the tables was continuously numbered, which allowed a join to be made with the generated rowset. It was a demonstration of the implementation options of GENERATE_SERIES.

Additional materials

There are many other options for generating number series, including those from Paul White, Itzik Ben-Gan, and others, featured in the challenge Number series generator challenge. IN fourth part There is an interesting solution from Paul White (dbo.GetNums_SQLkiwi), but it requires a bit of concentration and has version limitations (requires a table with a clustered columnstore index). You should always test this and other options thoroughly on your data and workload, especially when performance is important. Some solutions will only be available on relatively recent versions of SQL Server and/or when you have sufficient freedom in the implementation (for example, the ability to use the CLR).


In conclusion, we invite you to an open lesson “Overview of automated tools for migration from MS SQL Server to PostgreSQL”. There we will learn how to effectively and safely transfer data and a scheme from one platform to another, optimizing the process and minimizing risks. We will learn about the best practices and tools that will help you successfully migrate and avoid data loss.

You can sign up for a lesson on the course page “Practical course on migration from MS SQL Server to PostgreSQL”.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *