Based on an article by Craig Freedman: Nested Loops Join
SQL Server supports three physical join operators: nested loop join, merge join, and hash join. In this article, I will describe the nested loops join – Nested Loops Join (or NL join, for short).
In a simplified form, a nested loops join compares each row in one table (called the outer table) with each row in another table (called the inner table), looking for those rows that satisfy the join predicate. Note that the terms “internal” and “external” are ambiguous; their understanding depends on the context. “Inner table” and “outer table” refer to the rows consumed by the connection. “Inner join” and “outer join” refer to logical operations.
This algorithm in pseudocode can be represented as follows:
for each row R1 in the outer table for each row R2 in the inner table if R1 joins with R2 return (R1, R2)
These nested loops represent the same algorithm that gave the nested loop compound its name.
All rows must be compared, and thus the cost of this algorithm is proportional to the size of the outer table times the size of the inner table. Since the cost increases very much as the size of the joined tables increases, in practice it is reduced by reducing the selection from the inner table, the rows of which are processed for each row of the outer table.
For the following examples, let’s use the same schema as in my previous article:
create table Customers (Cust_Id int, Cust_Name varchar(10)) insert Customers values (1, 'Craig') insert Customers values (2, 'John Doe') insert Customers values (3, 'Jane Doe') create table Sales (Cust_Id int, Item varchar(10)) insert Sales values (2, 'Camera') insert Sales values (3, 'Computer') insert Sales values (3, 'Monitor') insert Sales values (4, 'Printer')
Consider this query:
select * from Sales S inner join Customers C on S.Cust_Id = C.Cust_Id option(loop join)
I used the “loop join” optimizer hint to explicitly set nested loop joins. As a result, the following query execution plan will be obtained, which was obtained when executed with the “set statistics profile on” setting:
In this execution plan, the outer table is Customers, while the inner table is Sales. Thus, it all starts with looking at the Customers table, from which customers are selected (one at a time) and, for each of them, the Sales table is looked at. Since we have 3 customers in the table, the Sales table is scanned 3 times. Each view of the Sales table returns 4 rows. We compare each sale to the current customer and evaluate if a couple of rows have the same Cust_Id. If it is, return that pair of rows. We have 3 customers and 4 sales, so this comparison only runs 3*4 = 12 times. Only three of these comparisons met the given requirements.
Now let’s see what happens if we create an index on the Sales table:
create clustered index CI on Sales(Cust_Id)
Now we get this plan:
This time, instead of looking at the Sales table, an index search is performed. The index lookup is still done three times, once for each client, but each index lookup returns only those rows that refer to C.Cust_Id and are qualified by the join predicate. So the lookup only returns 3 rows compared to the 12 rows returned by a table lookup.
Note that the index lookup depends on the C.CustId, which is obtained from the join’s outer table by looking up the Customers table. Each time an index lookup is performed (remember, it is performed three times, one for each client), C.CustId will have a different value. Note that C.CustId is essentially a “correlated parameter”. If the nested loops connection uses correlated parameters, they will be shown in the execution plan as “OUTER REFERENCES”. Nested loop joins often occur where there is an index lookup that depends on a correlated parameter, and this dependency is in the nature of an “index join” index join. This is a common scenario.
What types of join predicates does a nested loops join support?
Nested loop join supports all join predicates, including equivalence (equality) and inequality predicates.
What logical join operators does a nested loops join support?
Nested loop join supports the following logical join operators:
Left outer join
CROSS APPLY and OUTER APPLY
Left semi-join and left anti-semi-join
Nested loop join does not support the following logical join operators:
Why nested loops join only supports left joins?
The nested loops join algorithm is easily extensible to support left outer joins and semi joins. As an example, the pseudocode for a left outer join is shown below. You can write similar pseudocode for left semi-join and left anti-semi-join.
for each row R1 in the outer table
for each row R2 in the inner table
if R1 joins with R2
return (R1, R2)
if R1 did not join
return (R1, NULL)
This algorithm keeps track of the join of each row of the outer table, and if, after looking through all the rows of the inner table, it cannot find a row in the inner table to join, it creates additional rows with NULL values.
Now let’s see what options exist to support a right outer join. In this case, return (R1, R2) pairs for rows that satisfy the join condition, and (NULL, R2) pairs for internal table rows that do not satisfy the join condition. The problem is that the lookup of the inner table is done multiple times, once for each row of the outer join. As a result of these several scans, the same rows of the internal table can be obtained. At the same time, it is not known how to determine that a particular row of the internal table does not satisfy or will not satisfy the join condition? Also, if we use an index join, some inner table rows may be skipped altogether, even though they should be returned for the outer join.
Fortunately, since a right outer join can easily be rewritten as a left outer join, and a right semi join is rewritten as a left semi join, it remains possible to use a nested loops join for a right outer join and a semi join. However, while such conversions are valid, they can affect performance. For example, the “Customer left outer join Sales” join applied in the above schema with a clustered index on Sales can use a lookup on the Sales index, similar to how it is done in the inner join example. On the other hand, the join “Customer right outer join Sales” can be converted to “Sales left outer join Customer”, but now we need an index on Customer.
Why aren’t full outer joins supported?
Nested loop join does not support full outer join. However, you can always convert “T1 full outer join T2” to “T1 left outer join T2 UNION T2 left anti-semi-join T1”. The basis of such a transformation is the transformation of a full outer join into a left outer join, which includes all pairs of rows from T1 and T2 to be joined, and all rows of T1 that are not to be joined, after which, using an anti-semi-join, the rows of T2 are added, which also cannot be combined. This is how it might look like:
select * from Customers C full outer join Sales S on C.Cust_Id = S.Cust_Id
The concatenation here is the union of the red strings (2-4) – the left outer join, and the green strings (5-9) – the semi-join, which allows you to implement a full outer join. The Compute Scalar is simply the calculation of the null constant for the Customer columns, which is necessary because the semi-join itself cannot generate output values for the inner table (in this case, the Customer table is meant). It turns out that a semi-join only tests for existence; those. that there is no actual match for any string. TOP is used for additional (optional) optimization purposes, it allows you to stop looking at the table when more than one row is created. Again, the semi-join only performs an existence test, so a single match is all we need here. By “not required” I meant that the semi-join would stop after the first match even without TOP.
Note that in the last example, the optimizer chooses to scan the inner clustered index even though a lookup could have been used. This is due to the cost of the solution. The table is so small (fits on one page) that there is no difference between browsing and searching.
NL connection – good or bad?
This question is improper. There is no “best” join algorithm, and no join algorithm can be considered good or bad. Each connection algorithm works well in certain conditions and may not work well in other conditions. Because nested loop join efficiency depends on the size of the outer data multiplied by the size of the inner data, this join is best used for relatively small inputs. The inner dataset does not need to be small, but if it is large, creating an index on the join key with good selectivity can help.
In some cases, a nested loop join may be the only join algorithm that SQL Server will be able to use. SQL Server must use nested loop joins in case of cross joins, as well as some complex CROSS APPLY and OUTER APPLY joins. In addition, with one exception (for a full outer join), nested loop join is the only available join algorithm that SQL Server can use without equi-join predicates.
What’s next …
In the next article, I will continue with the physical join operators by describing how a merge join works.