Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

the most common value among all neighbors is taken, and for numerical values the

average value is used. Therefore, a proximity measure between instances is needed

for it to be defined. The Euclidean distance (it is a case of a L p norm distance) is the

most commonly used in the literature.

In order to estimate a MV y ih in the i th example vector y i by KNNI [ 6 ], we first

select K examples whose attribute values are similar to y i . Next, the MV is estimated

as the average of the corresponding entries in the selected K expression vectors.

When there are other MVs in y i and/or y j , their treatment requires some heuristics.

The missing entry y ih is estimated as average:

j ∈ I Kih y jh

|

y

ih =

,

(4.26)

I Kih |

where I Kih is now the index set of KNN examples of the i th example, and if y jh

is missing the j th attribute is excluded from I Kih . Note that KNNI has no theoret-

ical criteria for selecting the best K-value and the K-value has to be determined

empirically.

4.5.2 Weighted Imputation with K-Nearest Neighbour (WKNNI)

The Weighted KNN method [ 93 ] selects the instances with similar values (in terms

of distance) to incomplete instance, so it can impute as KNNI does. However, the

estimated value now takes into account the different distances to the neighbors, using

a weighted mean or the most repeated value according to a similarity measure. The

similarity measure s i (

between two examples y i and y j is defined by the Euclidian

distance calculated over observed attributes in y i . Next we define the measure as

follows:

y j )

2

1

/

s i =

O i O j (

y ih −

y jh )

,

(4.27)

h i ∈

where O i ={

.

The missing entry y ih is estimated as average weighted by the similarity measure:

h

|

the h th component of y i is observed

}

j ∈ I Kih s i (

y j )

y jh

y ih =

j ∈ I Kih s i (

,

(4.28)

y j )

where I Kih is the index set of KNN examples of the i th example, and if y jh is missing

the j th attribute is excluded from I Kih . Note that KNNI has no theoretical criteria for

selecting the best K-value and the K-value has to be determined empirically.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home