Transact SQL for data analysis

Transact SQL language is a procedural extension of the SQL language and is used in the Microsoft SQL Server DBMS. In this article we will look at using some T-SQL elements for data analysis. We'll look at window functions first, and then we'll talk about data pivoting and unpivoting methods. When pivoting, the data is converted from rows to columns, and when unprocessing is done, the data is converted from columns to rows, similar to Excel.

And here are the windows

A window function is a function that operates on a window, i.e. a set of strings, and returns a value based on some calculation. And a window is a subset of rows based on the window handle that refers to the current row. We'll also take a look at OVER, which is a T-SQL statement that specifies a window to apply the window function to.

If this sounds too technical, just think about having to perform a calculation over a set and return a single value. The classic example is aggregate calculations such as SUM, COUNT, AVG, but there are other functions such as ranking and offset. But the use of these functions has disadvantages that are easily eliminated by window functions, which we will discuss later.

Grouped queries do provide a summary view of new information, but they also result in one detail being lost. Once you group rows, all calculations in the query must be performed in the context of the defined groups. Often you need to perform calculations that involve both individual and aggregate elements. Window functions have no such restrictions.

The window function is evaluated for each individual row and applied to a subset of the rows obtained from the underlying query result set. The output of the window function is a scalar value that is added as another column to the query result. In other words, unlike grouped functions, windowed functions do not result in a loss of detail.

For example, let's say you want to query the order total for a customer and get the current value and the percentage it represents of the customer's total orders. If you group them by customer, you can only get the total number of customer orders. With a window function, you can return the customer total in addition to the individual order cost, and even calculate a percentage of the current order value from the customer total.

Now let's assume that the base query contains table operators, filters and other query elements – these do not affect what the subquery considers as a starting point. If you need a subquery to apply to a base query result set as a starting point, you need to repeat all the base query logic in the subquery. In contrast to this approach, the window function is applied to a subset of rows from the base query's result set, rather than to a new view of the data. This way, whatever you add to the base query is automatically applied to all window functions used in the query. If desired, you can further limit the window.

Another advantage of using window functions is that you get the ability to define the order as part of the calculation specification in cases where this is possible. This way the order is determined for calculation and is not confused with the order of presentation.

Below is an example of a query on sales data. A view of EmpOrders in a TSQLV6 database that uses a window function to calculate the running totals for each employee by month:

USE TSQLV6;
SELECT empid, ordermonth, val,
  SUM(val) OVER(PARTITION BY empid
    ORDER BY ordermonth
    ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW) AS runval
FROM Sales.EmpOrders;

The resulting output will look like this:

The window function definition that you specify in the request called by OVER has up to three parts: a window-partition block, a window-order block, and a window-frame block. In this case, the empty OVER() clause represents the entire result set of the base query.

Using a WINDOW block allows you to specify all or part of a window specification in a query, and then use the window name in the OVER block of that query. Its main purpose is to reduce the query string length when you have duplicate window specifications. This is available in SQL Server 2022 and later, and in Azure SQL Database as long as the database compatibility level is set to 160 or higher. You can find out the compatibility level using the following query:

SELECT DATABASEPROPERTYEX(N'TSQLV6', N'CompatibilityLevel');

When considering all the basic query commands (SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY), you place WINDOW between the HAVING and ORDER BY query elements.

Consider the following query as an example:

SELECT empid, ordermonth, val,
  SUM(val) OVER(PARTITION BY empid
    ORDER BY ordermonth
    ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW) AS runsum,
  MIN(val) OVER(PARTITION BY empid
    ORDER BY ordermonth
    ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW) AS runmin,
  MAX(val) OVER(PARTITION BY empid
    ORDER BY ordermonth
    ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW) AS runmax,
  AVG(val) OVER(PARTITION BY empid
    ORDER BY ordermonth
    ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW) AS runavg
FROM Sales.EmpOrders;

Data Pivoting

Pivoting data (for simplicity, I suggest using the Anglicism pivoting) involves transferring data from a row state to a column state, possibly with simultaneous aggregation of values. In many cases, this data processing is done at the presentation layer for purposes such as reporting.

For further examples we will use the following data structure:

USE TSQLV6;
DROP TABLE IF EXISTS dbo.Orders;
CREATE TABLE dbo.Orders
(
  orderid INT NOT NULL
    CONSTRAINT PK_Orders PRIMARY KEY,
  orderdate DATE NOT NULL,
  empid INT NOT NULL,
  custid VARCHAR(5) NOT NULL,
  qty INT NOT NULL
);
INSERT INTO dbo.Orders(orderid, orderdate, empid, custid, qty)
VALUES
  (30001, '20200802', 3, 'A', 10),
  (10001, '20201224', 2, 'A', 12),
  (10005, '20201224', 1, 'B', 20),
  (40001, '20210109', 2, 'A', 40),
  (10006, '20210118', 1, 'C', 14),
  (20001, '20210212', 2, 'B', 12),
  (40005, '20220212', 3, 'A', 10),
  (20002, '20220216', 1, 'C', 20),
  (30003, '20220418', 2, 'B', 15),
  (30004, '20200418', 3, 'C', 22),
  (30007, '20220907', 3, 'D', 30);
SELECT * FROM dbo.Orders;

Let's say you need to query this table and return the total number of orders for each employee and customer. To solve this problem, use the following grouped query:

SELECT empid, custid, SUM(qty) AS sumqty
FROM dbo.Orders
GROUP BY empid, custid;

The result will look like this:

But, you have a requirement to present the result in the following form:

Let's look at these three steps with an example. First you need to create one row of results for each unique employee ID. This means that the rows from dbo.Orders must be grouped by the empid attribute (the grouping element).

The dbo.Orders table contains one column containing all customer ID values and one column containing the number of orders. The summary analysis task requires that for each unique customer ID, a separate results column is created that contains the aggregated quantities for that customer.

Finally, because summary analysis involves grouping, you need to combine the data to obtain the resulting values at the “intersection” of the grouping (employee) and distribution (customer) members. You need to define an aggregation function (in this case SUM) and an aggregation element (in this case the qty attribute).

Recall that pivoting includes grouping, propagation, and aggregation. In this example, you group by empid, spread (quantity) by custid, and aggregate by SUM (qty). Once you define the elements involved in the summary, the rest is simply including those elements in the right places in the overall query template for the summary.

Here's a complete solution query that rolls up the order data, returning the total quantity for each employee (by rows) and customer (by columns):

SELECT empid,
  SUM(CASE WHEN custid = 'A' THEN qty END) AS A,
  SUM(CASE WHEN custid = 'B' THEN qty END) AS B,
  SUM(CASE WHEN custid = 'C' THEN qty END) AS C,
  SUM(CASE WHEN custid = 'D' THEN qty END) AS D
FROM dbo.Orders
GROUP BY empid;

The result will be identical to the requirements presented above.

But you can also use the PIVOT operator for the summary. Instead of working directly with the orders table dbo.Orders, PIVOT works with a derived table called D, which contains only the pivot elements empid, custid, and qty.

SELECT empid, A, B, C, D
FROM dbo.Orders
  PIVOT(SUM(qty) FOR custid IN(A, B, C, D)) AS P;

The result will be similar.

We have a cancellation

Unpivoting is a technique that moves data from a column state to a row state. It typically involves querying the summary state of the data and retrieving from each source row multiple result rows, each containing a different value from the source column. A common use case is to unpivot data imported from a spreadsheet into a database to simplify further processing.

USE TSQLV6;
DROP TABLE IF EXISTS dbo.EmpCustOrders;
CREATE TABLE dbo.EmpCustOrders
(
  empid INT NOT NULL
    CONSTRAINT PK_EmpCustOrders PRIMARY KEY,
  A VARCHAR(5) NULL,
  B VARCHAR(5) NULL,
  C VARCHAR(5) NULL,
  D VARCHAR(5) NULL
);
INSERT INTO dbo.EmpCustOrders(empid, A, B, C, D)
  SELECT empid, A, B, C, D
  FROM (SELECT empid, custid, qty
    FROM dbo.Orders) AS D
    PIVOT(SUM(qty) FOR custid IN(A, B, C, D)) AS P;
SELECT * FROM dbo.EmpCustOrders;

The table has a row for each employee, a column for each of the four customers A, B, C, and D, and a number of orders for each employee and customer. Note that irrelevant intersections (combinations of employees and customers in which there were no overlapping order activities) are represented by zero values. Let's say you receive an unsubmit request that requires you to return a row for each employee and customer along with the order quantity, if any. The result should look like this:

Unpivoting involves creating two result columns from any number of source columns—one to store the source column names as strings, and one to store the source column values. In the example below, you need to unpin the original columns A, B, C, and D to end up with a custid name column and a qty value column. Similar to the PIVOT operator, T-SQL also supports the UNPIVOT operator, which allows you to unpin data. The general form of a query with the UNPIVOT operator is as follows:

SELECT empid, custid, qty
FROM dbo.EmpCustOrders
  UNPIVOT(qty FOR custid IN(A, B, C, D)) AS U;

Essentially, the UNPIVOT statement follows the same steps as the PIVOT statement discussed earlier.

Conclusion

In this article, we looked at some aspects of using window functions in T-SQL. In fact, you can use many more different MS SQL Server tools to provide this functionality.