Introduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

agglomerative ones start by considering each example as a cluster and perform-

ing an iterative merging of clusters until a criterion is satisfied. Partitioning based

clustering, with k-Means algorithms as the most representative, starts with a fixed

k number of clusters and iteratively adds or removes examples to and from them

until no improvement is achieved based on a minimization of intra and/or inter

cluster distance measure. As usual when distance measures are involved, numeric

data is preferable together with no-missing data and the absence of noise and out-

liers. Other well known examples of clustering algorithms are COBWEB and Self

Organizing Maps.

Association Rules: they are a set of techniques that aim to find association rela-

tionships in the data. The typical application of these algorithms is the analysis

of retail transaction data [ 1 ]. For example, the analysis would aim to find the

likelihood that when a customer buys product X, she would also buy product Y.

Association rule algorithms can also be formulated to look for sequential patterns.

As a result of the data usually needed for association analysis is transaction data,

the data volumes are very large. Also, transactions are expressed by categorical

values, so the data must be discretized. Data transformation and reduction is often

needed to performhigh quality analysis in this DMproblem. The Apriori technique

is the most emblematic technique to address this problem.

1.3 Supervised Learning

In the DM community, prediction methods are commonly referred to as supervised

learning. Supervisedmethods are thought to attempt the discovery of the relationships

between input attributes (sometimes called variables or features) and a target attribute

(sometimes referred to as class). The relationship which is sought after is represented

in a structure called a model. Generally, a model describes and explains experiences,

which are hidden in the data, and which can be used in the prediction of the value

of the target attribute, when the values of the input attributes are known. Supervised

learning is present in many application domains, such as finance, medicine and

engineering.

In a typical supervised learning scenario, a training set is given and the objective

is to form a description that can be used to predict unseen examples. This training

set can be described in a variety of ways. The most common is to describe it by a set

of instances, which is basically a collection of tuples that may contain duplicates.

Each tuple is described by a vector of attribute values. Each attribute has an associate

domain of values which are known prior to the learning task. Attributes are typically

one of two types: nominal or categorical (whose values are members of an unordered

set), or numeric (values are integer or real number, and an order is assumed). The

nominal attributes have a finite cardinality, whereas numeric attributes domains are

delimitated by lower and upper bounds. The instance space (the set of possible

examples) is defined as a cartesian product of all the input attributes domains. The

Search WWH ::

Custom Search

Home