Graphics Reference
In-Depth Information
If data includes consumer loans and mortgages, it seems logical to partition both
types of loans because the parameters and quantities involved in each one are
completely different. Thus, it is a good idea to build separate models on each
partition.
To assist regarding the balance of data and occurrence of rare events . Predictive
DM algorithms like ANNs or decision trees are very sensitive to imbalanced data
sets. An imbalanced data set is one in which one category of the target variable
is less represented compared to the other ones and, usually, this category has is
more important from the point of view of the learning task. Balancing the data
involves sampling the imbalanced categories more than average (over-sampling)
or sampling the common less often (under-sampling) [ 3 ].
To divide a data set into three data sets to carry out the subsequent analysis of DM
algorithms . As we have described in Chap. 2 , the original data set can be divided
into the training set and testing set. A third kind of division can be performed
within the training set, to aid the DM algorithm to avoid model over-fitting, which
is a very common strategy in ANNs and decision trees. This partition is usually
known as validation set, although, in various sources, it may be denoted as the
testing set interchangeably [ 22 ]. Whatever the nomenclature used, some learners
require an internal testing process and, in order to evaluate and compare a set
of algorithms, there must be an external testing set independent of training and
containing unseen cases.
Various forms of data sampling are known in data reduction. Suppose that a large
data set, T , contains N examples. The most common ways that we could sample T
for data reduction are [ 11 , 24 ]:
Simple random sample without replacement (SRSWOR) of size s :Thisis
created by drawing s of the N tuples from T ( s
<
N ), where the probability
of drawing any tuple in T is 1
/
N , that is, all examples have equal chance to be
sampled.
Simple random sample with replacement (SRSWR) of size s : This is similar
to SRSWOR, except that each time a tuple is drawn from T , it is recorded and
replaced. In other words, after an example is drawn, it is placed back in T and it
may be drawn again.
Balanced sample : The sample is designed according to a target variable and
is forced to have a certain composition according to a predefined criterion. For
example, 90% of customers who are older tah or who are 21 years old, and 10%
of customers who are younger than 21 years old. One of the most successful
application of this type of sampling has been shown in imbalanced learning, as we
have mentioned before.
Cluster sample : If the tuples in T are grouped into G mutually disjointed groups
or clusters, then an SRS of s clusters can be obtained, where s
G . For example,
in spatial data sets, we may choose to define clusters geographically based on how
closely different areas are located.
<
Stratified sample :If T is divided into mutually disjointed parts called strata ,
a stratified sample of T is generated by obtaining an SRS at each stratum. This
 
Search WWH ::




Custom Search