Why statistics are important to SQL Server

A few days remain until the start of a new flow at the rate “MS SQL Server Developer”. In anticipation of the start of the course, we continue to share useful material with you.

Over the years of working with SQL Server, I have discovered that there are several topics that are often ignored. That they are afraid, they think that they are complex or that they are not so important. There is also an opinion that this knowledge is not needed, since SQL Server “does everything for me.” I heard that about indexes. I heard this about statistics.

So, let’s talk about why statistics are important and why knowing about their importance will help you significantly affect the performance of your queries.

There are several cases when statistics are processed automatically:

  • SQL Server automatically creates statistics for indexes
  • SQL Server automatically creates statistics for columns when it needs more information to optimize a query
    • IMPORTANT! This only happens when the database option is enabled. auto_create_statistics. This option is enabled by default in all versions and editions of SQL Server. However, sometimes there are recommendations to turn it off. I am categorically against this.
  • SQL Server automatically updates statistics for indexes and columns when it becomes “out of date” if necessary to optimize the query.

However, it is important to know that although statistics are processed automatically, this process does not always work as well as you expect. For truly effective SQL Server optimization, you need certain templates and data sets. To understand this, I want to talk a bit about data access and the optimization process.

Data access

Typically, when you send a query to SQL Server to retrieve data, you write Transact-SQL code as a simple SELECT or, possibly, as a stored procedure (yes, there are other options). However, the main thing is that you say what kind of data set you want to receive, and not describe how this data should be extracted. How does SQL Server get to the data?

Data processing

Statistics help you get and process SQL Server’s data. Statistics provide information on how much data will need to be processed when executing the request. If you need to process a small amount of data, then processing can be simpler (possibly in a different way) than if the request processed millions of rows.

In particular, SQL Server uses a cost based query optimizer. There are other optimization options, but today, most often, cost-based optimizers are used. Why? Cost-based optimizers use information about the requested data to formulate more efficient, optimal and focused plans, taking into account information about this data. This process usually works well. Although with plans that are saved for subsequent executions (cached plans), there may be problems. However, other optimization methods have even more serious flaws.

IMPORTANT! I’m not talking about a plan cache here … I’m talking about the initial optimization process and how SQL Server determines how much data it will need to get. Subsequent execution of the cached plan can lead to additional problems with it (known as parameter sniffing). Much has been written about this in other articles.

To explain cost-based optimization, let me talk about other types of optimization. This will help you understand the benefits of cost optimization.

Syntax Optimization

SQL Server can process a query using only its text and not waste time finding the best order for processing tables. The optimizer can simply join your tables in the order you specify them in FROM. Although in order to start executing the request no costs are required, but the execution of the request itself may be far from optimal. In general, joining large tables with small ones is less optimal than joining small tables with large ones. Let’s look at these two examples:

USE [WideWorldImporters];
GO

SET STATISTICS IO ON;
GO

SELECT [so].*, [li].*
FROM [sales].[Orders] AS [so]
JOIN [sales].[OrderLines] AS [li] ON [so].[OrderID] = [li].[OrderID]
WHERE [so].[CustomerID] = 832 AND [so].[SalespersonPersonID] = 2
OPTION (FORCE ORDER);
GO

SELECT [so].*, [li].*
FROM [sales].[OrderLines] AS [li]
JOIN [sales].[Orders] AS [so] ON [so].[OrderID] = [li].[OrderID]
WHERE [so].[CustomerID] = 832 AND [so].[SalespersonPersonID] = 2
OPTION (FORCE ORDER);
GO

Compare cost plans. The second request is much more expensive.

Cost of identical queries with FORCE ORDER with different connection order.

Yes, this is an oversimplification of ways to optimize the connection, but the fact is that you are unlikely to specify the optimal table order in FROM yourself. The good news is that if you have performance problems, then you can apply optimizations such as joining tables in the specified order (see the example above). There are many other hints that you can use:

  • QUERY Hints (forced use of concurrency level [MAXDOP], query optimization to quickly get the first rows, and not the entire data set using FAST n, etc.)
  • Table hints (forced use of index [INDEX]index search [FORCESEEK] etc.)
  • Join hints (forced use of the connection type LOOP / MERGE / HASH)

I must say that the use of hints should always be the last way to optimize. You should try other optimization methods before using hints. Using them, you do not allow SQL Server to optimize your queries when adding or changing indexes, when adding or changing statistics, or when updating SQL Server. Of course, unless you have documented the hints used and tested them to make sure that they remain useful for your query after completing database maintenance tasks, updating data, and updating SQL Server. Honestly, I do not see this done as often as I would like. Most hints are added as if they will always work perfectly, and remain in the request until serious problems arise.

It is always important to look at other ways to improve performance before applying hints. Yes, it’s easy to put the hint now, but the time you spend looking for other solutions can pay off in the long run.

For problems with a specific request with its specific values, it is worth looking at:

  • Is the statistics accurate / up to date? Will updating statistics fix the problem?
  • Statistics based on partial sampling? Will FULLSCAN fix the problem?
  • Can you rewrite the request and improve the plan?
    • Are there any bad search terms? Columns should always be on one side of the expression:
      • So good: MonthlySalary > expression / 12
      • So bad: MonthlySalary * 12 > expression
    • Are there any transitivity issues you can help SQL Server with? This is an interesting question, and it becomes more and more problematic with the increasing number of tables. Adding a seemingly redundant condition can help the optimizer see what it did not take into account in the original request (see the additional condition AND t2.colX = 12 below in WHERE):

FROM table2 AS t1
JOIN table2 AS t2 ON t1.colX = t2.colX
WHERE t1.colX = 12 AND t2.colX = 12

  • Sometimes a simple move from join to subquery or from subquery to join fixes the problem (no, one is not always better than the other, but sometimes rewriting can help the optimizer).
  • Sometimes using derived tables (subqueries in FROM (…) AS J1) can help the optimizer join tables more optimally.
  • Do you have OR conditions? Can you rewrite them through UNION or UNION ALL to get the same result (this is most important) with a better execution plan? Be careful, semantically these are different requests. You need to understand the difference between all of them well.
    • OR removes duplicate rows (based on row ID)
    • UNION removes duplicates based on columns specified in SELECT
    • UNION ALL combines sets (which can be much faster than removing duplicates), but this may (or may not be) a problem:
      • sometimes there are no duplicates (you must know your details)
      • it is sometimes permissible to return duplicates (you must know your users / audience / application)

This is a very small list, but it can help you improve performance without the use of hints. This means that with subsequent changes in data, indexes, statistics, in the version of SQL Server, the optimizer will be able to take these changes into account!

But still it’s great that there are hints, and if we really need them, we can use them. There are situations when the optimizer may not be able to come up with an effective plan. For these cases, you can use hints. So yes, I like that they are. I just don’t want you to use them until you determine the true cause of the problem.

So, yes, you can achieve syntax-level optimization … if you need it.

Optimization using rules (heuristic)

I mentioned that cost-based optimization requires statistics. But what if you do not have statistics?

You can also look at it on the other hand – why can’t SQL Server simply use the “ruleset” to quickly optimize queries without having to view / analyze information about your data? Wouldn’t that be faster? SQL Server can do this, but often this is not the best solution. To demonstrate this, you must prevent SQL Server from using statistics when processing the request. I can show an example when it really works well, and much more examples when it works badly.

Heuristics are rules. Simple, static, fixed rules. The fact that they are simple is their advantage. No need to look at the data. A simple and quick prediction-based query estimation is made. For example, “less” and “more” have an internal rule of “30%”. Simply put, when you run a query with the predicate “more” or “less” and there is no data information (statistics), SQL Server will use a rule that says that 30% of the data will correspond to the condition. The optimizer will use this in his estimates and come up with a plan that matches this rule.
For this to work, you must first disable auto_create_statistics and check existing indexes and statistics:

USE [WideWorldImporters];
GO

ALTER DATABASE [WideWorldImporters]
SET AUTO_CREATE_STATISTICS OFF;
GO

EXEC sp_helpindex '[sales].[Customers]';
EXEC sp_helpstats '[sales].[Customers]', 'all';
GO

Take a look sp_helpindex and sp_helpstats. In the database WideWorldImporters in the table Customers on the column DeliveryPostalCode by default there are no indexes and statistics. If you added something yourself (or SQL Server created it automatically), you should remove them before executing the following examples.

For the first request we will put ZipCodeequal to 90248 and use the predicate “less”. Let’s see how SQL Server estimates the number of rows without using statistics and the possibility of its automatic creation.

SELECT [c1].[CustomerID],
      [c1].[CustomerName],
      [c1].[PostalCityID],
      [c1].[DeliveryPostalCode]
FROM [sales].[Customers] AS [c1]
WHERE [c1].[DeliveryPostalCode] < '90248';

Columns without statistics will use heuristics.

If you do not find the "ideal" value, then most of the time these rules will be wrong!

For the first query, the estimate works well (30% of 663 = 198.9), since the actual number of rows for the query is 197. One important point to pay attention to is the warning next to the table Customers and near the leftmost SELECT statement. It tells us that something is wrong here. Although the estimate of the number of lines is “correct”.

For the second query, we take the value ZipCode equal to 90003. The request is exactly the same except for the ZipCode value. How will SQL Server now evaluate the number of rows?

SELECT [c1].[CustomerID],
       [c1].[CustomerName],
       [c1].[PostalCityID],
       [c1].[DeliveryPostalCode]
FROM [sales].[Customers] AS [c1]
WHERE [c1].[DeliveryPostalCode] < '90003';

Columns without statistics use heuristics (simple rules). Often they are very wrong!
For the second query, the score is also 198.9, but actually only 1. Why? Because without statistics, the heuristic for “less” (and “more”) is 30%. Thirty percent of 663 is 198.9. The specific value changes when the data is modified, but the percentage remains constant at 30%.

If this query is more complex (with connections and / or additional predicates), then the presence of incorrect information is already a problem for the subsequent optimization steps. Yes, from time to time you may be lucky with heuristics, but this is unlikely. Moreover, the heuristic for BETWEEN and “equal” differs from the values ​​for “less” and “greater” (equal to 30%). In fact, some of them even change depending on the cardinality assessment model you use (for example, for “equal”). But should this bother me? Not really! In fact, I never want to use them.

So, SQL Server can use rule-based optimization ... but only when it does not have the best information.

I do not want heuristics! I want statistics!
Statistics is one of the few places in SQL Server that cannot be few. No, I'm not talking about creating statistics for each column of the table, but there are some cases where you can create statistics first. But this is a topic for a separate article.
So why is statistics so important for cost optimization?

Cost based optimization

What does cost-based optimization actually do? In short, SQL Server quickly gets a rough estimate of how much data will be processed. Then, using this information, he estimates the cost of various algorithms that can be used to access the data. After that, based on the “cost” of these algorithms, SQL Server chooses the one that, according to its calculations, is the least expensive. Then he compiles and executes it.

It sounds great, but there are many factors when it may not work as well as we would like. Most importantly, the basic information used to perform the assessment (statistics) may be incorrect:

  • Statistics may be outdated
  • Statistics may not be accurate due to limited sampling
  • Statistics may not be accurate due to the size of the table and the limitations of what is stored in it.

Some will ask me - can SQL Server have more detailed statistics (more detailed histograms, etc.)? Yes maybe. But then the process of reading / accessing this, more and more statistics will become more expensive (and take more time, more cache, etc.). Which, in turn, will make the optimization process more expensive. This is a difficult problem. Everywhere there are pros and cons, compromises. In fact, everything is not as simple as “detailed histograms”.

Finally, in the optimization process, all possible combinations of plans cannot be analyzed. This would make the optimization process itself so expensive that it would not be permissible!

Total

The best way to think about the optimization process is to find “Good plan fast”. Otherwise, the optimization process would become so complicated and drag on so much that it would not reach its goal!

So why statistics is so important:

  • It is used throughout the process of cost optimization (and you want cost-based optimization)
  • It must be present, otherwise you will be forced to use heuristics
    • as a rule, I highly recommend enabling the option auto_create_statisticsif you turned it off
  • It should be relevant in order to give more accurate estimates (it is important to update it and some statistics may need your “help” more than others to remain relevant)
  • She may need additional help (so as it is updated when updated, or with additional statistics)

STATISTICS ARE THE KEY TO BETTER OPTIMIZATION AND, hence, BETTER PRODUCTIVITY (but it's still far from ideal!)

I hope this article motivates you to learn more about statistics. The statistics in SQL Server are actually simpler than you think, and obviously they are very, very important!
A rather old but still useful article - Statistics Used by the Query Optimizer in Microsoft SQL Server 2008

Learn more about the course at open house records.

Similar Posts

Leave a Reply