Window functions with a “window” or how to use a frame

Hello. On February 26, classes started in OTUS in a new group at the course “MS SQL Server Developer”. In this regard, I want to share with you my publication about window functions. By the way, in the coming week you can still join the group ;-).


Window functions are well-established in our practice, but few people know how the RANGE and ROWS frames work.

Perhaps that is why they are somewhat less common. The purpose of this article is to provide examples of use so that you definitely do not have any “Who is who?” and “How to apply it?”. The question “Why?” The article will remain unlit.
Let’s look at what a frame is and how to achieve a similar effect using ORDER By in the OVER () clause.

For demonstration, we will use a simple table so that you can calculate examples without using a compiler. In general, I highly recommend it – look and think over what will result from the execution, and then check yourself – this way you will find white spots in the perception of the work of window functions, which may not be obvious at all when you read ready-made results.

Table

Create table sales 
(sales_id INT PRIMARY KEY, sales_dt DATETIME2 DEFAULT GETUTCDATE(),  customer_id INT, item_id INT, cnt INT, price_per_item DECIMAL(19,4));

INSERT INTO sales
(sales_id, sales_dt, customer_id, item_id, cnt, price_per_item)
VALUES
(1, '2020-01-10T10:00:00', 100, 200, 2, 30.15),
(2, '2020-01-11T11:00:00', 100, 311, 1, 5.00),
(3, '2020-01-12T14:00:00', 100, 400, 1, 50.00),
(4, '2020-01-12T20:00:00', 100, 311, 5, 5.00),
(5, '2020-01-13T10:00:00', 150, 311, 1, 5.00),
(6, '2020-01-13T11:00:00', 100, 315, 1, 17.00),
(7, '2020-01-14T10:00:00', 150, 200, 2, 30.15),
(8, '2020-01-14T15:00:00', 100, 380, 1, 8.00),
(9, '2020-01-14T18:00:00', 170, 380, 3, 8.00),
(10, '2020-01-15T09:30:00', 100, 311, 1, 5.00),
(11, '2020-01-15T12:45:00', 150, 311, 5, 5.00),
(12, '2020-01-15T21:30:00', 170, 200, 1, 30.15);

Let’s start with a simple moment – the differences in the SUM function with and without sorting

SELECT sales_id, customer_id, count, 
SUM(count) OVER () as total,
SUM(count) OVER (ORDER BY customer_id) AS cum,
SUM(count) OVER (ORDER BY customer_id, sales_id) AS cum_uniq
FROM sales
ORDER BY customer_id, sales_id;

Let’s look at the first inconspicuous rake, how do you think how many developers will think that cum and cum_uniq are the same when reading the code? Think a little? Perhaps, but because here it is obvious, and is it so obvious when reading code in the application, and even with the not so obvious nonunique character of the sort field.

Now open our wonderful window
The window, or rather the frame is 2 types of ROWS and RANGE, first get acquainted with ROWS.
Frame restriction options:

  1. Everything before the current line / range and the value of the current line itself
    BETWEEN UNBOUNDED PRECEDING
    BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  2. The current line / range and everything after it
    BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  3. With a specific indication of how many lines before and after to include (not supported for RANGE)
    BETWEEN N Preceding AND N Following
    BETWEEN CURRENT ROW AND N Following
    BETWEEN N Preceding AND CURRENT ROW

But let’s see the frame in action.

We open the “window” on the current line and all the previous ones, for the SUM function, as you can see, this coincides with the ASC sorting

SELECT sales_id, customer_id, cnt, 
SUM(cnt) OVER (ORDER BY customer_id, sales_id) AS cum_uniq,
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS UNBOUNDED PRECEDING) AS current_and_all_before,
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS current_and_all_before2
FROM sales
ORDER BY customer_id, sales_id;

I want to remind you that the sort order in the window (in the OVER () clause) is not related to the sort order in the query itself, in the example it is the same in order to simplify the calculation if you decide to check the calculation and your understanding of how the function works

SELECT sales_id, customer_id, cnt, 
SUM(cnt) OVER (ORDER BY customer_id, sales_id) AS cum_uniq,
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS UNBOUNDED PRECEDING) AS current_and_all_before,
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS current_and_all_before2
FROM sales
ORDER BY cnt;

Now let’s look at the functionality of the frame when we include all subsequent lines.

SELECT sales_id, customer_id, cnt, 
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS current_and_all_frame,
SUM(cnt) OVER (ORDER BY customer_id DESC, sales_id DESC) AS current_and_all_order_desc
FROM sales
ORDER BY customer_id, sales_id;

As you can see here, you can also get the same result for an aggregate function, if you use reverse sorting – but ROWS is a little more stable, because if your sorting fields are not unique, you can get a surprise in the form of the same values.

And finally, an option that can no longer be imitated by sorting, when we specify a specific number of lines that should be included in the frame

SELECT sales_id, customer_id, cnt, 
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS before_and_current,
cnt,
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS current_and_1_next,
cnt,
SUM(cnt) OVER (ORDER BY customer_id, sales_id ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS before2_and_2_next
FROM sales
ORDER BY customer_id, sales_id;

In this option, you already specifically indicate which lines are in the range, and in my opinion, it is most obvious from the results.

The difference between ROWS and RANGE

The difference is that ROWS operates on a row and RANGE on a range. True, this is obvious from the name, but explains little in practice?
Let’s look at the picture (source at the bottom of the article)

Now, if we take a close look, it will become obvious that rows with the same value of the sort parameter are called a range.
As mentioned above, ROWS is limited to a string, while RANGE captures the entire range of matching values ​​that you specify in the ORDER BY window function.

SELECT sales_id, customer_id, cnt, 
SUM(cnt) OVER (ORDER BY customer_id) AS cum_uniq,
cnt,
SUM(cnt) OVER (ORDER BY customer_id ROWS UNBOUNDED PRECEDING) AS current_and_all_before,
customer_id,
cnt,
SUM(cnt) OVER (ORDER BY customer_id RANGE UNBOUNDED PRECEDING) AS current_and_all_before2
FROM sales
ORDER BY 2, sales_id;

ROWS – always operates with a specific line, even if the sorting is not unique, but RANGE just combines periods into ranges with the matching values ​​of the sorting fields. In this sense, the functionality is very similar to the behavior of the SUM () function with sorting by a non-unique field. Let’s see another example.

SELECT sales_id, customer_id, price_per_item, cnt, 
SUM(cnt) OVER (ORDER BY customer_id, price_per_item) AS cum_uniq,
cnt,
SUM(cnt) OVER (ORDER BY customer_id, price_per_item ROWS UNBOUNDED PRECEDING) AS current_and_all_before,
customer_id,
cnt,
SUM(cnt) OVER (ORDER BY customer_id, price_per_item RANGE UNBOUNDED PRECEDING) AS current_and_all_before2
FROM sales
ORDER BY 2, price_per_item;

There are already 2 fields and range is determined by a range with matching values ​​for both fields

And the option is when we include in the calculation all subsequent lines from the current one, which in the case of the SUM function coincides with the value that can be obtained using reverse sorting

SELECT sales_id, customer_id, price_per_item, cnt, 
SUM(cnt) OVER (ORDER BY customer_id DESC, price_per_item DESC) AS cum_uniq,
cnt,
SUM(cnt) OVER (ORDER BY customer_id, price_per_item ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS current_and_all_before,
customer_id,
cnt,
SUM(cnt) OVER (ORDER BY customer_id, price_per_item RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS current_and_all_before2
FROM sales
ORDER BY 2, price_per_item, sales_id desc;

As you can see, RANGE again captures the entire range of matching pairs.

Despite the fact that the functionality of ROWS and RANGE is far from new each time, questions arise about how to use it. I hope this article added understanding of how ROWS and RANGE differ and now you will not doubt in which case this or that frame is needed.

Source of the illustration about the difference between RANGE and ROWS
Window functions with SQL Server 2016, Mark Tabladillo

Catch on course

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *