Databases Reference
In-Depth Information
The first example is the Levenhstein distance [ Levenshtein 1966 ] between two
character strings. It computes the minimal number of operations costs needed to
transform one source string into the target string, where an operation is an insertion,
deletion or substitution of a single character. Each operation may have a different
cost. For instance, a substitution can have a cost equal to 2, while insertions and
deletions may cost 1. Users can tune these costs according to their needs.
Between the two string dept and department , one requires six character inser-
tions to transform dept into department . If an insertion costs 1, then the Levenhstein
distance between both strings is 6.
The second similarity measure that we study is Jaro Winkler [ Winkler 1999 ].
This measure is also terminological and it compares two character strings. It extends
Jaro measure by taking into account the order of the characters of both strings. Fur-
thermore, it promotes higher similarity values between strings which share similar
prefixes. Consequently, it includes two parameters. The first one is the length of the
prefix substring while the second represents a constant scaling factor for how much
the score is adjusted upwards for having common prefixes.
For further reading, we advise you to check the following list of resources [ Cohen
et al. 2003 ; Euzenat et al. 2004 ]. Several packages also describe similarity measures
and their parameters, for instance SecondString 1 or SimMetrics. 2
4.2
Thresholds
Most similarity measures are normalized to return a value in the range [0, 1]. Among
all candidate pairs of schema elements, selecting the ones to propose as mappings
can be performed with a threshold. That is, all candidate pairs whose similarity value
(from one measure or resulting from a combination of several measures) is above
a given threshold become mappings. Many tools [ Avesani et al. 2005 ; Madhavan
et al. 2001 ; Duchateau et al. 2008b ; Drumm et al. 2007 ] have a threshold for select-
ing mappings. In most cases, a default value for the threshold is provided with the
tool, e.g., 0:6 for the string-matching threshold in S-Match [ Giunchiglia et al. 2007 ].
COMA
[ Aumueller et al. 2005 ] includes a threshold often combined with a top-
K strategy (i.e., the best K correspondences are returned) and a MaxDelta strategy
(i.e., the best correspondence is returned with the closest ones, whose score only dif-
fers by a Delta tolerance value). Conversely, APFEL [ Ehrig et al. 2005 ] is a machine
learning-based tools which features an automatic threshold tuning.
As the value distribution is very different from a similarity measure to another,
a schema matcher can have one specific threshold for each similarity measure.
This is the case with MatchPlanner [ Duchateau et al. 2008a ]. The extended ver-
sion of this matcher enables the automatic learning of these thresholds, thanks to
CC
1 SecondString (May 2010): http://sourceforge.net/projects/secondstring/ .
2 SimMetrics (May 2010): http://www.dcs.shef.ac.uk/ sam/stringmetrics.html .
Search WWH ::




Custom Search