Database Reference

In-Depth Information

Most clustering methods use
distance scores
for the calculation of
similarity
. It

is important to realise that distance scores, which express relative distances

between data objects, may be calculated in different ways. First of all, it depends

on the data type whether distances can be calculated at all. If this is possible, the

distances first need to be normalised (i.e., expressed in terms of a particular

standard distance) and may then be calculated using the multidimensional

equation of Pythagoras, usually called the
Euclidian distance
.
17
A weighing of the

distances is also possible, if particular attributes are considered more important.

It should be mentioned that the number of dimensions
n
included in the

clustering method might need to be limited for several reasons. For instance, the

complexity of the clustering method should not be too high, in order to retain

reasonable calculation times.
18
But high-dimensional spaces also make it difficult

to interpret the results, since it may be hard to apply intuition. And, finally, the

distance scores between any two data points in high-dimensional spaces will not

really be different from the scores in lower-dimensional spaces if the extra

dimensions are not relevant.
19

The calculation of distance scores usually requires several assumptions. For

instance, when the data concerns persons, it is assumed that persons of the same

type are close together in the data space. Another assumption may be that persons

of the same type show the same behaviour.

As a by-product of clustering, often isolation points, so-called
outliers
can be

identified in the dataset. Although there also exist techniques that directly find

outliers, most techniques are based upon first finding a strong clustering, and then

reporting those points that do not conform to any of the found clusters.

2.4.3 Pattern Mining

The third and last class is that of the
pattern mining
techniques. Pattern mining is

also unsupervised as no labels are required. Whereas clustering and classification

techniques try to build global models of the data, pattern mining aims at the

identification of locally valid, surprising patterns. Although technically speaking,

a large collection of many small patterns could be considered a global model of

the data, the quality of the patterns is not measured in terms of how well together

17
The multidimensional (
n
dimensions) equation of Pythagoras states that the distance (d)

n

between x and y is

=

2

(

x

−

y

)

or, with weights w
i
and normalization on x
i
,

i

i

i

1

2

n

x

−

y

=

w

i

i

.

i

x

i

1

i

18
In general, the complexity should be no higher than
n
log
n
, where
n
is the number of

records; see Adriaans, P. and Zantinge, D. (1996).

19
Irrelevant dimensions may be added, but for these dimensions x
i
≈

y
i
, which means that the

resulting distance calculated by the multidimensional equation of Pythagoras is hardly

influenced by these extra dimensions.

Search WWH ::

Custom Search