Database Reference
In-Depth Information
Most clustering methods use distance scores for the calculation of similarity . It
is important to realise that distance scores, which express relative distances
between data objects, may be calculated in different ways. First of all, it depends
on the data type whether distances can be calculated at all. If this is possible, the
distances first need to be normalised (i.e., expressed in terms of a particular
standard distance) and may then be calculated using the multidimensional
equation of Pythagoras, usually called the Euclidian distance . 17 A weighing of the
distances is also possible, if particular attributes are considered more important.
It should be mentioned that the number of dimensions n included in the
clustering method might need to be limited for several reasons. For instance, the
complexity of the clustering method should not be too high, in order to retain
reasonable calculation times. 18 But high-dimensional spaces also make it difficult
to interpret the results, since it may be hard to apply intuition. And, finally, the
distance scores between any two data points in high-dimensional spaces will not
really be different from the scores in lower-dimensional spaces if the extra
dimensions are not relevant. 19
The calculation of distance scores usually requires several assumptions. For
instance, when the data concerns persons, it is assumed that persons of the same
type are close together in the data space. Another assumption may be that persons
of the same type show the same behaviour.
As a by-product of clustering, often isolation points, so-called outliers can be
identified in the dataset. Although there also exist techniques that directly find
outliers, most techniques are based upon first finding a strong clustering, and then
reporting those points that do not conform to any of the found clusters.
2.4.3 Pattern Mining
The third and last class is that of the pattern mining techniques. Pattern mining is
also unsupervised as no labels are required. Whereas clustering and classification
techniques try to build global models of the data, pattern mining aims at the
identification of locally valid, surprising patterns. Although technically speaking,
a large collection of many small patterns could be considered a global model of
the data, the quality of the patterns is not measured in terms of how well together
17 The multidimensional ( n dimensions) equation of Pythagoras states that the distance (d)
n
between x and y is
=
2
(
x
y
)
or, with weights w i and normalization on x i ,
i
i
i
1
2
n
x
y
=
w
i
i
.
i
x
i
1
i
18 In general, the complexity should be no higher than n log n , where n is the number of
records; see Adriaans, P. and Zantinge, D. (1996).
19 Irrelevant dimensions may be added, but for these dimensions x i
y i , which means that the
resulting distance calculated by the multidimensional equation of Pythagoras is hardly
influenced by these extra dimensions.
Search WWH ::




Custom Search