Java Reference
In-Depth Information
This phenomenon of
clustering
is one of the main drawbacks of linear probing. Long chains tend to get longer
since the probability of hashing to a long chain is usually greater than that of hashing to a short chain. It is also easy
for two short chains to be joined, creating a longer chain that, in turn, will tend to get longer. For example, any key that
ends up in
num[7]
will create a long chain from locations
5
to
10
.
We define two types of clustering.
•
Primary clustering
occurs when keys that hash to different locations trace the same sequence
in looking for an empty location. Linear probing exhibits primary clustering since a key that
hashes to
5
, say, will trace
5
,
6
,
7
,
8
,
9
, and so on, and a key that hashes to
6
will trace
6
,
7
,
8
,
9
,
and so on.
•
Secondary clustering
occurs when keys that hash to the
same
location trace the same sequence
in looking for an empty location. Linear probing exhibits secondary clustering since keys that
hash to
5
, say, will trace the same sequence
5
,
6
,
7
,
8
,
9
, and so on.
Methods of resolving collisions that hope to improve on linear probing will target the elimination of primary
and/or secondary clustering.
You may wonder if using
loc = loc + k
where
k
is a constant greater than 1 (for example,
3
) will give any better
results than
loc = loc + 1
. As it turns out, this will not alter the clustering phenomenon since groups of
k
-apart keys
will still be formed.
In addition, it can even be worse than when
k
is 1 since it is possible that not all locations will be generated.
Suppose the table size is
12
,
k
is
3
, and a key hashes to
5
. The sequence of locations traced will be
5
,
8
,
11
,
2
(
11 + 3 - 12
),
5
, and the sequence repeats itself. In other words, only relatively few locations will be probed in
looking for an empty location. By comparison, when
k
is
1
, all locations are generated.
However, this is not really a problem. If the table size is
m
and
k
is “relatively prime” to
m
(their only common
factor is 1), then all locations are generated. Two numbers will be relatively prime if one is a prime and the other is
not a multiple of it, such as 5 and 12. But being prime is not a necessary condition. The numbers 21 and 50 (neither of
which is prime) are relatively prime since they have no common factors other than 1.
If
k
is
5
and
m
is
12
, a key hashing to
5
will trace the sequence
5
,
10
,
3
,
8
,
1
,
6
,
11
,
4
,
9
,
2
,
7
,
12
—all locations are
generated. A key hashing to any other location will also generate all locations.
In any case, being able to generate all locations is academic since if we had to trace many locations to find an
empty one, the search would be too slow, and we would probably need to use another method.
Notwithstanding what we've just said, it turns out that
loc = loc + k
, where
k
varies
with the key, gives us one of
the best ways to implement hashing. We will see how in Section 10.3.4.
So, how fast is the linear method? We are interested in the average
search length
, that is, the number of locations
that must be examined to find or insert a given key. In the previous example, the search length of
33
is
1
, the search
length of
61
is
2
, and the search length of
23
is
3
.
The search length is a function of the
load factor
,
f
, of the table, where
number of entries in table
f=
number of table locations
=
fraction of table filled
For a successful search, the average number of comparisons is
1
1
, and for an unsuccessful search,
2 1f
+
−
1
1
1+
the average number of comparisons is
. Note that the search length depends only on the fraction of the
( )
2
2
1-f
table filled,
not
on the table size.
Table
10-1
shows how the search length increases as the table fills up.
Search WWH ::
Custom Search