Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

tion of an instance differs when it is an unique record from when it is duplicated,

then a Bayesian inference problem can be formulated if these density functions

are known. The Bayes decision rule is a common approach [ 9 ] and several vari-

ants minimizing the error and the cost are well-known. They use an Expectation-

Maximization algorithm to estimate the conditional probabilities needed [ 34 ].

•

Supervised (and semisupervised) approaches. Well-known ML algorithms have

been used to detect duplicity in record entries. For example, in [ 3 ]CARTisusedfor

this task, whereas in [ 18 ] a SVM is used to merge the matching results for different

attributes of the instances. Clustering techniques are also applied, using graph

partitioning techniques [ 25 , 32 ], to establish those instances which are similar and

thus suitable for removing.

•

Distance-based techniques. Simple approaches like the use of the distance metrics

described above to establish the similar instances have been long considered in

the field [ 26 ]. Weighted modifications are also recurrent in the literature [ 5 ] and

even other approaches like ranking the most similar weighted instances to a given

one to detect the less duplicated tuple among all are also used [ 13 ].

•

When data is unsupervised, clustering algorithms are the most commonly used

option. Clustering bootstrapping[ 33 ] or hierarchical graph models encode the

attributes as binary “match-does not match” attributes to generate two probability

distributions for the observed values (instead of modeling the distributions as it is

done in the probabilistic approaches) [ 29 ].

3.3 Data Cleaning

After the data is correctly integrated into a data set, it does not mean that the data is free

from errors. The integration may result in an inordinate proportion of the data being

dirty [ 20 ]. Broadly, dirty data include missing data, wrong data and non-standard

representation of the same data. If a high proportion of the data is dirty, applying a

DMprocess will surely result in a unreliable model. Dirty data has varying degrees of

impact depending on the DM algorithm, but it is difficult to quantify such an impact.

Before applying any DM technique over the data, the data must be cleaned to

remove or repair dirty data. The sources of dirty data include data entry errors, data

update errors, data transmission errors and even bugs in the data processing system.

As a result, dirty data usually is presented in two forms: missing data and wrong

(noisy) data. The authors in [ 20 ] also include under this categorization inconsistent

instances, but we assume that such kind of erroneous instances have been already

addressed as indicated in Sect. 3.2.2 .

The presence of a high proportion of dirty data in the training data set and/or the

testing data set will likely produce a less reliable model. The impact of dirty data

also depends on the particular DM algorithm applied. Decision trees are known to

be susceptible to noise (specially if the trees are of higher order than two) [ 2 ]. ANNs

and distance based algorithms (like the KNN algorithm) are known to be susceptible

to noise. The use of distance measures is heavily dependent on the values of the data,

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home