Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques - page 485

Databases Reference

In-Depth Information

Figure10.21 A data set that is uniformly distributed in the data space.

a random variable, o , we want to determine how far away o is from being uniformly

distributed in the data space. We calculate the Hopkins Statistic as follows:

1. Sample n points, p 1 , . . . , p n , uniformly from D . That is, each point in D has the same

probability of being included in this sample. For each point, p i , we find the nearest

neighbor of p i .

1 i n

/

in D , and let x i be the distance between p i and its nearest

neighbor in D . That is,

x i D min

v 2 D f dist

.

p i , v

/g.

(10.25)

2. Sample n points, q 1 , . . . , q n , uniformly from D . For each q i .

, we find the

nearest neighbor of q i in D f q i g, and let y i be the distance between q i and its nearest

neighbor in D f q i g. That is,

1 i n

/

y i D min

v 2 D , v 6D q i f dist

.

q i , v

/g.

(10.26)

3. Calculate the Hopkins Statistic, H , as

P i D1 y i

P i D1 x i C P i D1 y i

H D

.

(10.27)

“What does the Hopkins Statistic tell us about how likely data set D follows a uni-

form distribution in the data space?” If D were uniformly distributed, then P i D1 y i and

P i D1 x i would be close to each other, and thus H would be about 0.5. However, if D were

highly skewed, then P i D1 y i would be substantially smaller than P i D1 x i in expectation,

and thus H would be close to 0.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home