What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

o

*

o

*

o

*

o

*

o

*

o

*

o

*

o

*

o

(A)

(B)

(C)

Fig. 2.4 Several methods of classification. A: Linear classification; B: A threshold, a

particular type of linear classification; C: Non-linear classification.

An important point is the way in which the class boundaries are set in the first

place. This may be done on the basis of an existing model or with the help of an

example-based method. Existing models are dependent on the context of the data.

Often, classes are chosen in such a way that they are similar in size or that they

contain equal numbers of persons or an equal amount of data. An often-used

example of equally sized classes is the five-year classes used in the classification

of ages. For classification based on equal numbers of persons or equal amounts of

data per class, the usual method is to determine the average and standard deviation

of the distribution and then determine the class boundaries in these terms. The

standard deviation is an indication of the extent to which the persons or the data

differ from the average. 13

Example-based methods determine class boundaries on the basis of a sample of

the data. This sample should be representative of the data, which means that the

composition of the sample should be comparable to the composition of the data.

Usually, when the sample is large enough and taken at random, this is the case.

Class boundaries may be determined on the basis of a sample using the clustering

techniques described in the previous subsection, or on the basis of an ad-hoc

hypothesis.

2.4.2 Clustering

The second large class of techniques is that of clustering . In clustering the goal is

to divide a given dataset into homogeneous subsets. As the application of

clustering does not require a set of pre-classified examples, it is often called an

unsupervised technique. Whereas classification requires a “teacher” supervising



=

2

(

x

−

x

)

i

s

=

i

1

where x

13 The standard deviation of an attribute x is expressed as:

n

−

1

is the average of all x's and n is the total number of x's.

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home