Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Φ

being r i the rank of the observation i and

the cumulative normal function.

This transformation is useful to obtain a new variable that is very likely to behave

like to a normally distributed one. However, this transformation cannot be applied

separately to the training and test partitions [ 30 ]. Therefore, this transformation is

only recommended when the test and training data is the same.

3.5.6 Box-Cox Transformations

A big drawback when selecting the optimal transformation for an attribute is that

we do not know in advance which transformation will be the best to improve the

model performance. The Box-Cox transformation aims to transform a continuous

variable into an almost normal distribution. As [ 30 ] indicates, this can be achieved

by mapping the values using following the set of transformations:

x λ − 1

/λ,

λ =

0

y

=

(3.29)

log

(

x

),

λ =

0

All linear, inverse, quadratic and similar transformations are special cases of the

Box-Cox transformations. Please note that all the values of variable x in Eq. ( 3.29 )

must be positive. If we have negative values in the attribute we must add a parameter

c to offset such negative values:

(

) λ − 1

x

+

c

/

g

λ,

λ =

0

y

=

(3.30)

log

(

x

+

c

)/

g

,

λ =

0

The parameter g is used to scale the resulting values, and it is often considered as the

geometric mean of the data. The value of

λ

is iteratively found by testing different

values in the range from

0 in small steps until the resulting attribute is as

close as possible to the normal distribution.

In [ 30 ] a likelihood function to be maximized depending on the value of

−

3

.

0to3

.

is

defined based on the work of Johnson and Wichern [ 19 ]. This function is computed

as:

λ

⎡

⎣

⎤

⎦ + (λ −

m

n

2 ln

1

m

2

L

(λ) =−

1 (

y j −

y

)

1

)

lnx j ,

(3.31)

j

=

j

=

1

where y j is the transformation of the value x j using Eq. ( 3.29 ), and y is the mean of

the transformed values.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home