Data Preparation Basic Models - Data Preprocessing in Data Mining - page 53

Graphics Reference

In-Depth Information

(a)

(b)

Fig. 3.1 Example of the histogram spreading made by a Box-Cox transformation: a before the

transformation and b after the transformation

3.5.7 Spreading the Histogram

Spreading the histogram is a special case of Box-Cox transformations. As Box-Cox

transforms the data to resemble a normal distribution, the histogram is thus spread

as shown in Fig. 3.1 .

When the user is not interested in converting the distribution to a normal one,

but just spreading it, we can use two special cases of Box-Cox transformations [ 30 ].

Using the logarithm (with an offset if necessary) can be used to spread the right side

of the histogram: y

. On the other hand, if we are interested in spreading

the left side of the histogram we can simply use the power transformation y

=

log

(

x

)

x g .

However, as [ 30 ] shows, the power transformation may not be as appropriate as

the Log transformation and it presents an important drawback: higher values of g

may help to spread the histogram but they will also cause problems with the digital

precision available.

=

3.5.8 Nominal to Binary Transformation

The presence of nominal attributes in the data set can be problematic, specially if

the DM algorithm used cannot correctly handle them. This is the case of SVMs and

ANNs. The first option is to transform the nominal variable to a numeric one, in

which each nominal value is encoded by an integer, typically starting from 0 or 1

onwards. Although simple, this approach has two big drawbacks that discourage it:

•

With this transformation we assume an ordering of the attribute values, as the

integer values are ranked. However the original nominal values did not present

any ranking among them.

•

The integer values can be used in operations as numbers, whereas the nominal

values cannot. This is even worse than the first point, as with this nominal to

integer transformation we are establishing unequal differences between pairs of

nominal values, which is not correct.

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home