RANK, DENSE_RANK, and NTILE

IN previous article function was discussed ROW_NUMBER. Now we will look at other ranking functions: RANK, DENSE_RANK And NTILE. Let's start with RANK and DENSE_RANK. These functions are similar in functionality and implementation to ROW_NUMBER. The difference is that ROW_NUMBER assigns a unique (increasing) value to each row, regardless of relationships in the ORDER BY values, while the RANK and DENSE_RANK functions assign the same value to rows that have the same ORDER BY value. The difference between the RANK and DENSE_RANK functions is how values ​​are assigned to rows. The easiest way to illustrate the difference between all these functions is with a simple example:

CREATE TABLE T (PK INT IDENTITY, A INT, B INT, C INT)
CREATE UNIQUE CLUSTERED INDEX TPK ON T(PK)
INSERT T VALUES (0, 1, 6)
INSERT T VALUES (0, 1, 4)
INSERT T VALUES (0, 3, 2)
INSERT T VALUES (0, 3, 0)
INSERT T VALUES (1, 0, 7)
INSERT T VALUES (1, 0, 5)
INSERT T VALUES (0, 2, 3)
INSERT T VALUES (0, 2, 1)
SELECT *,
    ROW_NUMBER() OVER (ORDER BY B) AS RowNumber,
    RANK() OVER (ORDER BY B) AS Rank,
    DENSE_RANK() OVER (ORDER BY B) AS DenseRank
FROM T
PK    A     B     C     RowNumber  Rank       DenseRank
----- ----- ----- ----- ---------- ---------- ----------
5     1     0     7     1          1          1
6     1     0     5     2          1          1
1     0     1     6     3          3          2
2     0     1     4     4          3          2
7     0     2     3     5          5          3
8     0     2     1     6          5          3
3     0     3     2     7          7          4
4     0     3     0     8          7          4

Note that the ROW_NUMBER function ignores duplicate values ​​in column B and assigns unique values ​​from 1 to 8 to all rows. The RANK and DENSE_RANK functions assign the same value to each duplicate row. Moreover, the RANK function counts duplicate rows, even if it assigns the same value to the repeated row, while the DENSE_RANK function does not take into account duplicate rows. For example, the RANK and DENSE_RANK functions assign a rank of 1 to the first two rows, but the RANK function assigns a rank of 3 to the third row (because it is the third row), and the DENSE_RANK function assigns a rank of 2 to the third row (because it contains the second unique value for column B ). As you can see, the maximum value returned by the DENSE_RANK function is equal to the number of unique values ​​in column B.

The query plan for all these functions is the same, and since they all use the same PARTITION BY and ORDER BY clauses, a single Sequence Project statement can evaluate all functions at once:

**       |--Segment [GROUP BY:([T].[B])]
**            |--Segment [GROUP BY:()]
                 |--Sort(ORDER BY:([T].[B] ASC))
                      |--Clustered Index Scan(OBJECT:([T].[TPK]))

The only significant difference between this plan and the plan with one ROW_NUMBER is the additional Segment operator with grouping by column B. Its purpose is to detect relationships and force the Sequence Project operator to output the same value when new values ​​are generated by the RANK and DENSE_RANK functions.

All recipes from the previous article about ROW_NUMBER (using an index to avoid sorting or the impact of mixing ranking functions with different PARTITION BY and/or ORDER BY clauses) also applies to the RANK and DENSE_RANK functions, so we will not repeat them.

Now let's look at the NTILE function. This function splits the data set into N groups of equal size. To determine how many rows belong to each group, SQL Server must first determine the total number of rows in the set. If the NTILE function includes a PARTITION BY clause, SQL Server must calculate the number of rows in each group separately. Knowing the number of rows in each group, we can write the NTILE function as:

NTILE(N) := (N * (ROW_NUMBER() - 1) / COUNT(*)) + 1

where COUNT

— the number of lines in each group. This equation uses -1 and +1 because the ROW_NUMBER and NTILE functions are 1-based, not 0-based.

SELECT *, COUNT(*) OVER (PARTITION BY A) AS Cnt FROM T
PK    A     B     C     Cnt
----- ----- ----- ----- ----------
1     0     1     6     6
2     0     1     4     6
3     0     3     2     6
4     0     3     0     6
7     0     2     3     6
8     0     2     1     6
5     1     0     7     2
6     1     0     5     2

We already know how to evaluate the ROW_NUMBER function, so we're wondering how to evaluate the COUNT function

  |--Nested Loops(Inner Join)
       |--Table Spool
       |    |--Segment [GROUP BY:([T].[A])]
       |         |--Sort(ORDER BY:([T].[A] ASC))
       |              |--Clustered Index Scan(OBJECT:([T].[TPK]))
       |--Nested Loops(Inner Join, WHERE:((1)))
            |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1005],0)))
            |    |--Stream Aggregate(DEFINE:([Expr1005]=Count(*)))
            |         |--Table Spool
            |--Table Spool

. It turns out that SQL Server uses the OVER clause, which we've already seen in ranking functions, and in almost all aggregate functions. This type of aggregate function is sometimes called a “windowed aggregate function”, which instead of returning one row for each group, returns the same aggregate result for each row in the group. Let's look at this with an example: Unlike a scalar aggregate query, which can only return aggregated function values, or a GROUP BY aggregate query, which can only return GROUP BY columns and aggregated values, this query returns all columns of a table. Moreover, since each row in the group receives the same result, the ORDER BY clause is not supported in this scenario. Now let's look at the plan of this query:Although this plan is more complex than a regular aggregate plan, it is actually simple too. The Sort and Segment operators split rows into groups in the same way that a stream aggregate does. A Table Spool over a Segment statement is a special type of subexpression buffer commonly called a Segment Spool. It copies all the rows of one group, and then when the Segment statement starts a new group, it sends the row to the topmost

Nested Loops

, which performs its iteration. At this stage, Table Spool produces rows from the current group, and Stream Aggregate counts them. Finally, the Stream Aggregate returns a counter to the lower Nested Loops, which connects this counter to rows from the current group – Table Spool outputs these rows a second time.

SELECT *, NTILE(2) OVER (PARTITION BY A ORDER BY B) AS NTile FROM T

Note that this plan requires a Segment Spool because there is no way to know in advance how many rows will be in the same group as the current row without looking through the entire group, and then once you've counted the rows you need to somehow go back and apply that counter to each of the original lines. Spool here gives you the opportunity to go back. A typical aggregate query with a GROUP BY clause does not have these problems because after evaluating the aggregate function (be it COUNT or any other function), it simply prints the result without returning to the original rows.

PK    A     B     C     NTile
----- ----- ----- ----- ------
1     0     1     6     1
2     0     1     4     1
7     0     2     3     1
8     0     2     1     2
3     0     3     2     2
4     0     3     0     2
5     1     0     7     1
6     1     0     5     2

Now let's look at a simple example with the calculation of the NTILE function. The query below groups the rows by column A and then, for each group, calculates which rows are below the median (assigned 1) and which rows are above the median (assigned 2):

**  |--Sequence Project(DEFINE:([Expr1003]=ntile))
       |--Compute Scalar(DEFINE:([Expr1007]=(1)))
            |--Segment [GROUP BY:([T].[A])]
**                 |--Nested Loops(Inner Join)
                      |--Table Spool
                      |    |--Segment [GROUP BY:([T].[A])]
                      |         |--Sort(ORDER BY:([T].[A] ASC, [T].[B] ASC))
                      |              |--Clustered Index Scan(OBJECT:([T].[TPK]))
                      |--Nested Loops(Inner Join, WHERE:((1)))
                           |--Stream Aggregate(DEFINE:([Expr1004]=Count(*)))
                           |    |--Table Spool
                           |--Table Spool

Here is the result of the request:

SELECT *, (2*(RowNumber-1)/Cnt)+1 AS MyNTile
FROM
    (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) AS RowNumber,
        COUNT(*) OVER (PARTITION BY A ) AS Cnt,
        NTILE(2) OVER (PARTITION BY A ORDER BY B) AS NTile
    FROM T
    ) TSeq

And here is the plan for this request:

  |--Compute Scalar(DEFINE:([Expr1006]=((2)*([Expr1003]-(1)))/CONVERT_IMPLICIT(bigint,[Expr1004],0)+(1)))
       |--Nested Loops(Inner Join)
            |--Table Spool
            |    |--Segment [GROUP BY:([T].[A])]
            |         |--Sequence Project(DEFINE:([Expr1003]=row_number, [Expr1005]=ntile))
            |              |--Compute Scalar(DEFINE:([Expr1010]=(1)))
            |                   |--Segment [GROUP BY:([T].[A])]
            |                        |--Nested Loops(Inner Join)
            |                             |--Table Spool
            |                             |    |--Segment [GROUP BY:([T].[A])]
            |                             |         |--Sort(ORDER BY:([T].[A] ASC, [T].[B] ASC))
            |                             |              |--Clustered Index Scan(OBJECT:([T].[TPK]))
            |                             |--Nested Loops(Inner Join, WHERE:((1)))
            |                                  |--Stream Aggregate(DEFINE:([Expr1007]=Count(*)))
            |                                  |    |--Table Spool
            |                                  |--Table Spool
            |--Nested Loops(Inner Join, WHERE:((1)))
                 |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1012],0)))
                 |    |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
                 |         |--Table Spool
                 |--Table Spool

This plan is basically the same as the previous one with COUNT

. The only difference is the addition of the additional Sequence Project and Sequence statements, which evaluate the NTILE function by calculating ROW_NUMBER and applying the formula shown above. Using this formula, we could calculate NTILE manually like this:Unfortunately, the optimizer doesn't know that it can use the same plan fragment to calculate the ranking and COUNT functionswindowed aggregate function, so ends up choosing an inefficient plan that calculates COUNTtwice: I'll leave the analysis of this plan to you as an exercise. Enjoy!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *