Ranking Functions: ROW_NUMBER

The four ranking functions: ROW_NUMBER, RANK, DENSE_RANK, and NTILE were introduced in SQL Server 2005 and differ from regular scalar functions in that the result they produce for a row depends on the other rows in the selection. They differ from aggregate functions in that they return only one row for each row in the input, i.e. they don't concatenate a set of strings into one. In this article, we'll look at ROW_NUMBER, the simplest of all ranking functions.

For examples we will use the following data set:

CREATE TABLE T (PK INT IDENTITY, A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TPK ON T(PK)
INSERT T VALUES (0, 1, 8)
INSERT T VALUES (0, 3, 6)
INSERT T VALUES (0, 5, 4)
INSERT T VALUES (0, 7, 2)
INSERT T VALUES (0, 9, 0)
INSERT T VALUES (1, 0, 9)
INSERT T VALUES (1, 2, 7)
INSERT T VALUES (1, 4, 5)
INSERT T VALUES (1, 6, 3)
INSERT T VALUES (1, 8, 1)

Let's start with the following simple example:

SELECT *, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber FROM T

This query orders the rows of the table by column B and assigns sequential numbers to those rows (the same as the identity column). The result looks like this:

PK    A     B     C     RowNumber
----- ----- ----- ----- ----------
6     1     0     9     1
1     0     1     8     2
7     1     2     7     3
2     0     3     6     4
8     1     4     5     5
3     0     5     4     6
9     1     6     3     7
4     0     7     2     8
10    1     8     1     9
5     0     9     0     10

This query will have this simple plan:

  |--Sequence Project(DEFINE:([Expr1003]=row_number))
       |--Compute Scalar(DEFINE:([Expr1005]=(1)))
            |--Segment [GROUP BY:()]
                 |--Sort(ORDER BY:([T].[B] ASC))
                      |--Clustered Index Scan(OBJECT:([T].[TPK]))

This plan scans the table, sorts it by column B, and then the Sequence Project statement assigns sequential numbers to each row.

The Segment operator typically breaks adjacent rows into groups of related rows. In our example this is not necessary since all the rows are in the same group. Note that the text version of the plan view for the Segment statement does not display GROUP BY columns; this information is only available for SHOWPLAN_ALL, SHOWPLAN_XML or in the graphical representation of the plan. I added an annotation to the text plan to explicitly show that the Segment statement does not have GROUP BY columns. Now we will look at an example in which we cannot do without the Segment operator.

The Compute Scalar operator is also not needed here and has been removed from such plans since SQL Server 2008.

ROW_NUMBER (and other ranking functions) also support dividing rows into groups and then evaluating the function separately for each group:

SELECT *, ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) AS RowNumber FROM T

Because column A has two values ​​(0 and 1), this query splits the table's rows into two groups and assigns row numbers to each group separately.

PK    A     B     C     RowNumber
----- ----- ----- ----- ----------
1     0     1     8     1
2     0     3     6     2
3     0     5     4     3
4     0     7     2     4
5     0     9     0     5
6     1     0     9     1
7     1     2     7     2
8     1     4     5     3
9     1     6     3     4
10    1     8     1     5

The plan for this query is almost identical to the previous plan:

  |--Sequence Project(DEFINE:([Expr1003]=row_number))
       |--Compute Scalar(DEFINE:([Expr1005]=(1)))
            |--Segment [GROUP BY:([T].[A])]
                 |--Sort(ORDER BY:( [T].[A] ASC, [T].[B] ASC))
                      |--Clustered Index Scan(OBJECT:([T].[TPK]))

Note that this plan sorts by columns A and B, and then the Segment statement groups the rows by column A. Sequence Project assigns sequential numbers to each row as in the first example, but resets the row counter at the beginning of each group. There are some similarities between how grouping works in this plan and how grouping works in a plan with the operator Stream Aggregate.

You can combine multiple ROW_NUMBER functions (or any number of other ranking functions) with different ORDER BY clauses in a single query:

SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B,
    ROW_NUMBER() OVER (ORDER BY C) AS RowNumber_C
FROM T
PK    A     B     C     RowNumber_B  RowNumber_C
----- ----- ----- ----- ------------ ------------
5     0     9     0     10           1
10    1     8     1     9            2
4     0     7     2     8            3
9     1     6     3     7            4
3     0     5     4     6            5
8     1     4     5     5            6
2     0     3     6     4            7
7     1     2     7     3            8
1     0     1     8     2            9
6     1     0     9     1            10

The plan for this query is similar to the plan for the first query, only it is repeated once for each ranking function, with a different sort order for each ORDER BY clause:

  |--Sequence Project(DEFINE:([Expr1004]=row_number))
       |--Compute Scalar(DEFINE:([Expr1008]=(1)))
            |--Segment [GROUP BY:()]
                 |--Sort(ORDER BY:( [T].[C] ASC))
                      |--Sequence Project(DEFINE:([Expr1003]=row_number))
                           |--Compute Scalar(DEFINE:([Expr1006]=(1)))
                                |--Segment [GROUP BY:()]
                                     |--Sort(ORDER BY:( [T].[B] ASC))
                                          |--Clustered Index Scan(OBJECT:([T].[TPK]))

Note that the ORDER BY clause is coupled to its ranking function and specifies the order for it only. Such ORDER BYs do not specify the order in which the rows will be returned by the query. To ensure the required order of the query strings, you need to explicitly add an ORDER BY clause to the query:

SELECT *, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber FROM T ORDER BY B

In this query, the additional ORDER BY clause will not affect the plan because the optimizer will realize that the data is already sorted by column B after the ROW_NUMBER function is evaluated.

Unfortunately, if we have several ROW_NUMBER functions with their own ORDER BYs and the entire query has an ORDER BY, the optimizer will not figure out whether it would be optimal to evaluate the ROW_NUMBER functions in a different order. For example, the following query will perform additional sorting:

SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B,
    ROW_NUMBER() OVER (ORDER BY C) AS RowNumber_C
FROM T
ORDER BY B
  |--Sort(ORDER BY:([T].[B] ASC))
       |--Sequence Project(DEFINE:([Expr1004]=row_number))
            |--Compute Scalar(DEFINE:([Expr1008]=(1)))
                 |--Segment
                      |--Sort(ORDER BY:([T].[C] ASC))
                           |--Sequence Project(DEFINE:([Expr1003]=row_number))
                                |--Compute Scalar(DEFINE:([Expr1006]=(1)))
                                     |--Segment
                                          |--Sort(ORDER BY:([T].[B] ASC))
                                               |--Clustered Index

But if we simply change the order of the two ROW_NUMBER functions in the selection, the additional sorting disappears:

SELECT *,
ROW_NUMBER() OVER (ORDER BY C) AS RowNumber_C,
ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B
FROM T
ORDER BY B
  |--Sequence Project(DEFINE:([Expr1004]=row_number))
       |--Compute Scalar(DEFINE:([Expr1008]=(1)))
            |--Segment
                 |--Sort(ORDER BY:([T].[B] ASC))
                      |--Sequence Project(DEFINE:([Expr1003]=row_number))
                           |--Compute Scalar(DEFINE:([Expr1006]=(1)))
                                |--Segment
                                     |--Sort(ORDER BY:([T].[C] ASC))
                                          |--Clustered Index Scan(OBJECT:([T].[TPK]))

As with the stream aggregate, SQL Server can exclude sorting for the Sequence Project statement if an index exists:

CREATE INDEX TB ON T(B)

SELECT PK, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber_B FROM T
  |--Sequence Project(DEFINE:([Expr1003]=row_number))
       |--Compute Scalar(DEFINE:([Expr1005]=(1)))
            |--Segment
                 |--Index Scan(OBJECT:([T].[TB]), ORDERED FORWARD)

However, because a query plan with multiple ROW_NUMBERs and multiple ORDER BYs layers multiple Sequence Project statements on top of the lookup, SQL Server can only use one index in such a plan. It turns out that such a request must include sorting. Even if the plan uses an index instead of one of the sorts, if it does not include all the query columns, the optimizer will choose whether to use the index along with Bookmark Lookup or use a clustered index and perform an additional sort.

It should be noted that SQL Server does not use ROW_NUMBER functions without an ORDER BY clause. In general, using a ranking function without an ORDER BY clause is meaningless because the results are not deterministic. However, in some cases you can do without the ORDER BY clause. For example, you can use ROW_NUMBER to assign ID values. In the article Row numbers with nondeterministic order Itzik Ben-Gan discusses similar
problems and solutions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *