Graphics Reference
In-Depth Information
assists in ensuring a representative sample. It is frequently used in classification
tasks where the class imbalance is present. It is very closely related with balanced
sample, but the predefined composition of the final results depends on the natural
distribution of the target variable.
An important preference of sampling for data reduction is that the cost of obtaining
a sample is proportionate to the size of the sample s , instead of being proportionate
to N . So, the sampling complexity is sub-linear to the size of data and there is no
need to conduct a complete pass of T to make decisions in order to or not to include
a certain example into the sampled subset. Nevertheless, the inclusion of examples
are made by unfounded decisions, allowing redundant, irrelevant, noisy or harmful
examples to be included. A smart way to make decisions for sampling is known as
IS, a topic that we will extend in Chap. 8 .
Advanced schemes of data sampling deserve to be described in this section. As
before, they are more difficult and allow better adjustments of data according to the
necessities and applications.
6.3.1 Data Condensation
The selection of a small representative subset from a very large data set is known as
data condensation. In some sources of DM, such as [ 22 ], this form of data reduction
is differentiated from others. In this topic, data condensation is integrated as one of
the families of IS methods (see Chap. 8 ) .
Data condensation emerges from the fact that naive sampling methods, such as
random sampling or stratified sampling, are not suitable for real-world problems
with noisy data since the performance of the algorithms may change unpredictably
and significantly. The data sampling approach practically ignores all the information
present in the samples which are not chosen in the reduced subset.
Most of the data condensation approaches are studied on classification-based
tasks, and in particular, for the KNN algorithm. These methods attempt to obtain a
minimal consistent set, i.e., a minimal set which correctly classifies all the original
examples. The very first method of this kind was the condensed nearest neighbor
rule (CNN) [ 12 ]. For a survey on data condensation methods for classification, we
again invite the reader to check the Chap. 8 of this topic.
Regarding the data condensation methods which are not affiliated with classifica-
tion tasks, termed generic data condensation, condensation is performed by vector
quantization, such as the well-known self-organizing map [ 19 ] and different forms
of data clustering. Another group of generic data condensation methods are situated
on the density-based techniques which consider the density function of the data for
the aspiration of condensation instead of minimizing the quantization error. These
approaches do not concern any learning process and, hence, are deterministic, (i.e.,
for a concrete input data set, the output condensed set is established). Clear examples
of this kind of approaches are presented in [ 10 , 21 ].
 
Search WWH ::




Custom Search