What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

the process by giving examples of the different classes hoping that the classifier

will learn to generalize them, in clustering such supervision is not required.

Example 3 (Clustering) Consider the set of all web-pages returned by a keyword

search “bush” on the Web. The resulting set of documents will contain documents

about the former president Bush Sr., of former president George W. Bush, of a

grunge band named Bush, the brand of beer with the same name, and maybe also

documents about vegetation. A clustering algorithm would divide, without

interaction of the user or a pre-defined taxonomy, group similar documents

together. Partitional clustering methods would do so by dividing the data into

disjoint groups, whereas hierarchical clustering algorithms give a complete

taxonomy.

Closely related to clustering is outlier detection . In outlier detection, one tries to

identify those objects that are unlike many other objects. Such outliers could

indicate for instance, errors in the data (e.g., outside temperatures of over 60

degrees Celsius), or potentially interesting exceptional cases. Conceptually,

outliers could be considered points not belonging to a large cluster, or forming a

cluster by themselves.

An important factor in clustering is the order in which data points are compared

with each other. Some important methods for determining this are hierarchical

clustering, k-means clustering, and neural network clustering. 14 Hierarchical

clustering starts by combining cases and clusters that are similar to each other, one

pair at a time. In each step, a pair of closest cases/clusters is merged. This is

repeated until the closeness of the clusters is larger than the determined threshold.

In k-means clustering , it is assumed that the data falls into a known number (k) of

clusters. First, a random profile is defined for each cluster. These profiles are

called cluster centres. Next, each data point is assigned to the cluster centre to

which it is most similar. Neural network clustering starts from so-called nodes that

work similarly to the neurons in the human brain. Each node computes the

weighted sum of its inputs (e.g., the distance of other nodes) and after a certain

threshold is subtracted, the result is passed to a non-linear function, e.g., a sigmoid

function. 15 The result of this function determines the importance of the node as a

clustering centre. Neural networks are constructed by connecting the output of a

node to the input of one or more other nodes. 16 It is important to select appropriate

weights and thresholds. The network can also 'learn', i.e., weights and thresholds

may be adjusted after several examples are compared with the desired output. In

this way, strong connections are kept and weak connections are disposed of.

14 SPSS Inc. (1999).







=

y

=

f

w

i x

−

θ









15 Hence,

each

node computes a function

where

i

1

f

(

x

)

= 1

is the sigmoid function and w i are the weights.

−

x

+

e

16 Holsheimer, M., and Siebes, A. (1991).

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home