Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

where max A and min A are the original maximum and minimum attribute values

respectively.

In the literature “normalization” usually refers to a particular case of the min-max

normalization in which the final interval is

[

0

,

1

]

, that is, ne

w −

min A

=

0 and

ne

is also typical when normalizing the data.

This type of normalization is very common in those data sets being prepared

to be used with learning methods based on distances. Using a normalization to re-

scale all the data to the same range of values will avoid those attributes with a large

max A −

w −

max A =

1. The interval

[−

1

,

1

]

min A difference dominating over the other ones in the distance calculation,

misleading the learning process by giving more importance to the former attributes.

This normalization is also known for speeding up the learning process in ANNs,

helping the weights to converge faster.

An alternative, but equivalent, formulation for the min-max normalization is

obtained by using a base value ne

min A and the desired new range R in which the

values will be mapped after the transformation. Some well-known software packages

such as SAS or Weka [ 14 ] use this type of formulation for the min-max transforma-

tion:

w −

R

min A

max A −

v

−

v =

ne

w −

min A +

.

(3.9)

min A

3.4.2 Z-score Normalization

In some cases, the min-max normalization is not useful or cannot be applied. When

the minimum or maximum values of attribute A are not known, the min-max normal-

ization is infeasible. Even when the minimum and maximum values are available,

the presence of outliers can bias the min-max normalization by grouping the values

and li m iting the digital precision available to represent the values.

If A is the mean of the values of attribute A and

σ A is the standard deviation,

original value v of A is normalized to v using

v

−

A

v =

.

(3.10)

σ A

By applying this transformation the attribute values now present a mean equal to 0

and a standard deviation of 1.

If the mean and standard deviation associated to the probability distribution are

not available, it is usual to use instead the sample mean and standard deviation:

n

1

n

A

=

v i ,

(3.11)

i

=

1

and

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home