Databases Reference
In-Depth Information
Figure10.21
A data set that is uniformly distributed in the data space.
a random variable,
o
, we want to determine how far away
o
is from being uniformly
distributed in the data space. We calculate the Hopkins Statistic as follows:
1.
Sample
n
points,
p
1
, . . . ,
p
n
, uniformly from
D
. That is, each point in
D
has the same
probability of being included in this sample. For each point,
p
i
, we find the nearest
neighbor of
p
i
.
1
i
n
/
in
D
, and let
x
i
be the distance between
p
i
and its nearest
neighbor in
D
. That is,
x
i
D
min
v
2
D
f
dist
.
p
i
,
v
/g.
(10.25)
2.
Sample
n
points,
q
1
, . . . ,
q
n
, uniformly from
D
. For each
q
i
.
, we find the
nearest neighbor of
q
i
in
D
f
q
i
g, and let
y
i
be the distance between
q
i
and its nearest
neighbor in
D
f
q
i
g. That is,
1
i
n
/
y
i
D
min
v
2
D
,
v
6D
q
i
f
dist
.
q
i
,
v
/g.
(10.26)
3.
Calculate the Hopkins Statistic,
H
, as
P
i
D1
y
i
P
i
D1
x
i
C
P
i
D1
y
i
H
D
.
(10.27)
“What does the Hopkins Statistic tell us about how likely data set
D
follows a uni-
form distribution in the data space?”
If
D
were uniformly distributed, then
P
i
D1
y
i
and
P
i
D1
x
i
would be close to each other, and thus
H
would be about 0.5. However, if
D
were
highly skewed, then
P
i
D1
y
i
would be substantially smaller than
P
i
D1
x
i
in expectation,
and thus
H
would be close to 0.