Databases Reference
In-Depth Information
Figure10.21 A data set that is uniformly distributed in the data space.
a random variable, o , we want to determine how far away o is from being uniformly
distributed in the data space. We calculate the Hopkins Statistic as follows:
1. Sample n points, p 1 , . . . , p n , uniformly from D . That is, each point in D has the same
probability of being included in this sample. For each point, p i , we find the nearest
neighbor of p i .
1 i n
/
in D , and let x i be the distance between p i and its nearest
neighbor in D . That is,
x i D min
v 2 D f dist
.
p i , v
/g.
(10.25)
2. Sample n points, q 1 , . . . , q n , uniformly from D . For each q i .
, we find the
nearest neighbor of q i in D f q i g, and let y i be the distance between q i and its nearest
neighbor in D f q i g. That is,
1 i n
/
y i D min
v 2 D , v 6D q i f dist
.
q i , v
/g.
(10.26)
3. Calculate the Hopkins Statistic, H , as
P i D1 y i
P i D1 x i C P i D1 y i
H D
.
(10.27)
“What does the Hopkins Statistic tell us about how likely data set D follows a uni-
form distribution in the data space?” If D were uniformly distributed, then P i D1 y i and
P i D1 x i would be close to each other, and thus H would be about 0.5. However, if D were
highly skewed, then P i D1 y i would be substantially smaller than P i D1 x i in expectation,
and thus H would be close to 0.
 
Search WWH ::




Custom Search