Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

If data includes consumer loans and mortgages, it seems logical to partition both

types of loans because the parameters and quantities involved in each one are

completely different. Thus, it is a good idea to build separate models on each

partition.

•

To assist regarding the balance of data and occurrence of rare events . Predictive

DM algorithms like ANNs or decision trees are very sensitive to imbalanced data

sets. An imbalanced data set is one in which one category of the target variable

is less represented compared to the other ones and, usually, this category has is

more important from the point of view of the learning task. Balancing the data

involves sampling the imbalanced categories more than average (over-sampling)

or sampling the common less often (under-sampling) [ 3 ].

•

To divide a data set into three data sets to carry out the subsequent analysis of DM

algorithms . As we have described in Chap. 2 , the original data set can be divided

into the training set and testing set. A third kind of division can be performed

within the training set, to aid the DM algorithm to avoid model over-fitting, which

is a very common strategy in ANNs and decision trees. This partition is usually

known as validation set, although, in various sources, it may be denoted as the

testing set interchangeably [ 22 ]. Whatever the nomenclature used, some learners

require an internal testing process and, in order to evaluate and compare a set

of algorithms, there must be an external testing set independent of training and

containing unseen cases.

Various forms of data sampling are known in data reduction. Suppose that a large

data set, T , contains N examples. The most common ways that we could sample T

for data reduction are [ 11 , 24 ]:

•

Simple random sample without replacement (SRSWOR) of size s :Thisis

created by drawing s of the N tuples from T ( s

<

N ), where the probability

of drawing any tuple in T is 1

/

N , that is, all examples have equal chance to be

sampled.

•

Simple random sample with replacement (SRSWR) of size s : This is similar

to SRSWOR, except that each time a tuple is drawn from T , it is recorded and

replaced. In other words, after an example is drawn, it is placed back in T and it

may be drawn again.

•

Balanced sample : The sample is designed according to a target variable and

is forced to have a certain composition according to a predefined criterion. For

example, 90% of customers who are older tah or who are 21 years old, and 10%

of customers who are younger than 21 years old. One of the most successful

application of this type of sampling has been shown in imbalanced learning, as we

have mentioned before.

•

Cluster sample : If the tuples in T are grouped into G mutually disjointed groups

or clusters, then an SRS of s clusters can be obtained, where s

G . For example,

in spatial data sets, we may choose to define clusters geographically based on how

closely different areas are located.

<

•

Stratified sample :If T is divided into mutually disjointed parts called strata ,

a stratified sample of T is generated by obtaining an SRS at each stratum. This

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home