Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

assists in ensuring a representative sample. It is frequently used in classification

tasks where the class imbalance is present. It is very closely related with balanced

sample, but the predefined composition of the final results depends on the natural

distribution of the target variable.

An important preference of sampling for data reduction is that the cost of obtaining

a sample is proportionate to the size of the sample s , instead of being proportionate

to N . So, the sampling complexity is sub-linear to the size of data and there is no

need to conduct a complete pass of T to make decisions in order to or not to include

a certain example into the sampled subset. Nevertheless, the inclusion of examples

are made by unfounded decisions, allowing redundant, irrelevant, noisy or harmful

examples to be included. A smart way to make decisions for sampling is known as

IS, a topic that we will extend in Chap. 8 .

Advanced schemes of data sampling deserve to be described in this section. As

before, they are more difficult and allow better adjustments of data according to the

necessities and applications.

6.3.1 Data Condensation

The selection of a small representative subset from a very large data set is known as

data condensation. In some sources of DM, such as [ 22 ], this form of data reduction

is differentiated from others. In this topic, data condensation is integrated as one of

the families of IS methods (see Chap. 8 ) .

Data condensation emerges from the fact that naive sampling methods, such as

random sampling or stratified sampling, are not suitable for real-world problems

with noisy data since the performance of the algorithms may change unpredictably

and significantly. The data sampling approach practically ignores all the information

present in the samples which are not chosen in the reduced subset.

Most of the data condensation approaches are studied on classification-based

tasks, and in particular, for the KNN algorithm. These methods attempt to obtain a

minimal consistent set, i.e., a minimal set which correctly classifies all the original

examples. The very first method of this kind was the condensed nearest neighbor

rule (CNN) [ 12 ]. For a survey on data condensation methods for classification, we

again invite the reader to check the Chap. 8 of this topic.

Regarding the data condensation methods which are not affiliated with classifica-

tion tasks, termed generic data condensation, condensation is performed by vector

quantization, such as the well-known self-organizing map [ 19 ] and different forms

of data clustering. Another group of generic data condensation methods are situated

on the density-based techniques which consider the density function of the data for

the aspiration of condensation instead of minimizing the quantization error. These

approaches do not concern any learning process and, hence, are deterministic, (i.e.,

for a concrete input data set, the output condensed set is established). Clear examples

of this kind of approaches are presented in [ 10 , 21 ].

Search WWH ::

Custom Search

Home