Graphics Reference
In-Depth Information
tion of an instance differs when it is an unique record from when it is duplicated,
then a Bayesian inference problem can be formulated if these density functions
are known. The Bayes decision rule is a common approach [ 9 ] and several vari-
ants minimizing the error and the cost are well-known. They use an Expectation-
Maximization algorithm to estimate the conditional probabilities needed [ 34 ].
Supervised (and semisupervised) approaches. Well-known ML algorithms have
been used to detect duplicity in record entries. For example, in [ 3 ]CARTisusedfor
this task, whereas in [ 18 ] a SVM is used to merge the matching results for different
attributes of the instances. Clustering techniques are also applied, using graph
partitioning techniques [ 25 , 32 ], to establish those instances which are similar and
thus suitable for removing.
Distance-based techniques. Simple approaches like the use of the distance metrics
described above to establish the similar instances have been long considered in
the field [ 26 ]. Weighted modifications are also recurrent in the literature [ 5 ] and
even other approaches like ranking the most similar weighted instances to a given
one to detect the less duplicated tuple among all are also used [ 13 ].
When data is unsupervised, clustering algorithms are the most commonly used
option. Clustering bootstrapping[ 33 ] or hierarchical graph models encode the
attributes as binary “match-does not match” attributes to generate two probability
distributions for the observed values (instead of modeling the distributions as it is
done in the probabilistic approaches) [ 29 ].
3.3 Data Cleaning
After the data is correctly integrated into a data set, it does not mean that the data is free
from errors. The integration may result in an inordinate proportion of the data being
dirty [ 20 ]. Broadly, dirty data include missing data, wrong data and non-standard
representation of the same data. If a high proportion of the data is dirty, applying a
DMprocess will surely result in a unreliable model. Dirty data has varying degrees of
impact depending on the DM algorithm, but it is difficult to quantify such an impact.
Before applying any DM technique over the data, the data must be cleaned to
remove or repair dirty data. The sources of dirty data include data entry errors, data
update errors, data transmission errors and even bugs in the data processing system.
As a result, dirty data usually is presented in two forms: missing data and wrong
(noisy) data. The authors in [ 20 ] also include under this categorization inconsistent
instances, but we assume that such kind of erroneous instances have been already
addressed as indicated in Sect. 3.2.2 .
The presence of a high proportion of dirty data in the training data set and/or the
testing data set will likely produce a less reliable model. The impact of dirty data
also depends on the particular DM algorithm applied. Decision trees are known to
be susceptible to noise (specially if the trees are of higher order than two) [ 2 ]. ANNs
and distance based algorithms (like the KNN algorithm) are known to be susceptible
to noise. The use of distance measures is heavily dependent on the values of the data,
 
Search WWH ::




Custom Search