Database Reference
In-Depth Information
Hash Join
Unlike the nested loop join, which works best on small inputs, and merge join, which excels on sorted inputs, a hash
join is designed to handle large unsorted inputs. The hash join algorithm consists of two different phases.
During the first, or build, phase, a hash join scans one of inputs, calculates the hash values of the join key, and
places it into the hash table. Next, in the second, or probe, phase, it scans the second input, and checks, or probes , if the
hash value of the join key from second input exists in the hash table. When this is the case, SQL Server evaluates the join
predicate for the row from the second input and all rows from the first input, which belong to the same hash bucket.
This comparison must be done because the algorithm that calculates the hash values does not guarantee the
uniqueness of the hash value of individual keys, which leads to hash collision when multiple different keys generate
the same hash. Even though there is the possibility of additional overhead from the extra comparison operations due
to hash collisions, those situations are relatively rare.
Listing 25-9 shows the algorithm of an inner hash join.
Listing 25-9. Inner hash join algorithm
/* Build Phase */
for each row R1 in input I1
begin
calculate hash value on R1 join key
insert hash value to appropriate bucket in hash table
end
/* Probe Phase */
for each row R2 in input I2
begin
calculate hash value on R2 join key
for each row R1 in hash table bucket
if R1 joins with R2
return join (R1, R2)
end
As you can guess, a hash join requires memory to store the hash table. The performance of a hash join greatly
depends on correct memory grant estimation. When the memory estimation is incorrect, the hash join stores some
hash table buckets in tempdb , which can greatly reduce the performance of the operator.
When this happens, SQL Server tracks where the buckets are located: either in-memory or on-disk. For each row
from the second input, it checks where the hash bucket is located. If it is in-memory, SQL Server processes the row
immediately. Otherwise, it stores the row in another internal temporary table in tempdb .
After the first pass is done, SQL Server discards in-memory buckets, replacing them with the buckets from disk
and repeats the probe phase for all of the remaining rows from the second input that were stored in the internal
temporary table. If there still weren't enough memory to accommodate all hash buckets, some of them would be
spilled on-disk again.
The number of times when this happens is called the recursion level . SQL Server tracks it and, eventually,
switches to a special bailout algorithm, which is less efficient, although it's guaranteed to complete at some point.
Tip
you can monitor hash table spills to tempdb with hash Warnings in SQL trace and extended events.
Similar to a merge join, hash joins require at least one equality predicate in the join condition.
 
 
Search WWH ::




Custom Search