Database Reference
In-Depth Information
o
*
*
o
o
o
*
*
*
*
o
o
*
*
*
o
o
o
o
*
*
o
o
*
o
*
*
o
o
o
*
o
o
*
*
*
o
o
o
(A)
(B)
(C)
Fig. 2.4 Several methods of classification. A: Linear classification; B: A threshold, a
particular type of linear classification; C: Non-linear classification.
An important point is the way in which the class boundaries are set in the first
place. This may be done on the basis of an existing model or with the help of an
example-based method. Existing models are dependent on the context of the data.
Often, classes are chosen in such a way that they are similar in size or that they
contain equal numbers of persons or an equal amount of data. An often-used
example of equally sized classes is the five-year classes used in the classification
of ages. For classification based on equal numbers of persons or equal amounts of
data per class, the usual method is to determine the average and standard deviation
of the distribution and then determine the class boundaries in these terms. The
standard deviation is an indication of the extent to which the persons or the data
differ from the average. 13
Example-based methods determine class boundaries on the basis of a sample of
the data. This sample should be representative of the data, which means that the
composition of the sample should be comparable to the composition of the data.
Usually, when the sample is large enough and taken at random, this is the case.
Class boundaries may be determined on the basis of a sample using the clustering
techniques described in the previous subsection, or on the basis of an ad-hoc
hypothesis.
2.4.2 Clustering
The second large class of techniques is that of clustering . In clustering the goal is
to divide a given dataset into homogeneous subsets. As the application of
clustering does not require a set of pre-classified examples, it is often called an
unsupervised technique. Whereas classification requires a “teacher” supervising
=
2
(
x
x
)
i
s
=
i
1
where x
13 The standard deviation of an attribute x is expressed as:
n
1
is the average of all x's and n is the total number of x's.              Search WWH ::

Custom Search