Introduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

make the prediction, their influence when using voting or weighting mechanisms

and the use of efficient algorithms to find the nearest examples, as KD-Trees or

hashing schemes. The K-Nearest Neighbor (KNN) is the most applied, useful and

known method in DM. Nevertheless, it suffers from several drawbacks such as

high storage requirements, low efficiency in prediction response, and low noise

tolerance. Thus, it is a good candidate to be improved through data reduction

procedures.

Support Vector Machines: SVMs are machine learning algorithms based on

learning theory [ 30 ]. They are similar to ANNs in the sense that they are used for

estimation and performvery well when data is linearly separable. SVMs usually do

not require the generation of interaction among variables, as regression methods

do. This fact should save some data preprocessing steps. Like ANNs, they require

numeric non-missing data and are commonly robust against noise and outliers.

Regarding symbolic methods, we mention the following:

Rule Learning: also called separate-and-conquer or covering rule algorithms [ 12 ].

All methods share the main operation. They search for a rule that explains some

part of the data, separate these examples and recursively conquer the remaining

examples. There are many ways for doing this, and also many ways to interpret the

rules yielded and to use them in the inference mechanism. From the point of view

of data preprocessing, generally speaking, they require nominal or discretized data

(although this task is frequently implicit in the algorithm) and dispose of an innate

selector of interesting attributes from data. However, MVs, noisy examples and

outliers may prejudice the performance of the final model. Good examples of these

models are the algorithms AQ, CN2, RIPPER, PART and FURIA.

Decision Trees: comprising predictive models formed by iterations of a divide-

and-conquer scheme of hierarchical decisions [ 28 ]. They work by attempting to

split the data using one of the independent variables to separate data into homoge-

neous subgroups. The final formof the tree can be translated to a set of If-Then-Else

rules from the root to each of the leaf nodes. Hence, they are closely related to rule

learning methods and suffer from the same disadvantage as them. The most well

known decision trees are CART, C4.5 and PUBLIC.

Considering the data descriptive task, we prefer to categorize the usual problems

instead of the methods, due to the fact that both are intrinsically related to the case

of predictive learning.

Clustering: it appears when there is no class information to be predicted but the

examples must be divided into natural groups or clusters [ 2 ]. These clusters re-

flect subgroups of examples that share some properties or have some similarities.

They work by calculating a multivariate distance measure between observations,

the observations that are more closely related. Roughly speaking, they belong to

three broad categories: Agglomerative clustering, divisive clustering and partition-

ing clustering. The former two are hierarchical types of clustering opposite one

another. The divisive one applies recursive divisions the entire data set whereas

Search WWH ::

Custom Search

Home