Tuning for Schema Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

The first example is the Levenhstein distance [ Levenshtein 1966 ] between two

character strings. It computes the minimal number of operations costs needed to

transform one source string into the target string, where an operation is an insertion,

deletion or substitution of a single character. Each operation may have a different

cost. For instance, a substitution can have a cost equal to 2, while insertions and

deletions may cost 1. Users can tune these costs according to their needs.

Between the two string dept and department , one requires six character inser-

tions to transform dept into department . If an insertion costs 1, then the Levenhstein

distance between both strings is 6.

The second similarity measure that we study is Jaro Winkler [ Winkler 1999 ].

This measure is also terminological and it compares two character strings. It extends

Jaro measure by taking into account the order of the characters of both strings. Fur-

thermore, it promotes higher similarity values between strings which share similar

prefixes. Consequently, it includes two parameters. The first one is the length of the

prefix substring while the second represents a constant scaling factor for how much

the score is adjusted upwards for having common prefixes.

For further reading, we advise you to check the following list of resources [ Cohen

et al. 2003 ; Euzenat et al. 2004 ]. Several packages also describe similarity measures

and their parameters, for instance SecondString 1 or SimMetrics. 2

4.2

Thresholds

Most similarity measures are normalized to return a value in the range [0, 1]. Among

all candidate pairs of schema elements, selecting the ones to propose as mappings

can be performed with a threshold. That is, all candidate pairs whose similarity value

(from one measure or resulting from a combination of several measures) is above

a given threshold become mappings. Many tools [ Avesani et al. 2005 ; Madhavan

et al. 2001 ; Duchateau et al. 2008b ; Drumm et al. 2007 ] have a threshold for select-

ing mappings. In most cases, a default value for the threshold is provided with the

tool, e.g., 0:6 for the string-matching threshold in S-Match [ Giunchiglia et al. 2007 ].

COMA

[ Aumueller et al. 2005 ] includes a threshold often combined with a top-

K strategy (i.e., the best K correspondences are returned) and a MaxDelta strategy

(i.e., the best correspondence is returned with the closest ones, whose score only dif-

fers by a Delta tolerance value). Conversely, APFEL [ Ehrig et al. 2005 ] is a machine

learning-based tools which features an automatic threshold tuning.

As the value distribution is very different from a similarity measure to another,

a schema matcher can have one specific threshold for each similarity measure.

This is the case with MatchPlanner [ Duchateau et al. 2008a ]. The extended ver-

sion of this matcher enables the automatic learning of these thresholds, thanks to

CC

1 SecondString (May 2010): http://sourceforge.net/projects/secondstring/ .

2 SimMetrics (May 2010): http://www.dcs.shef.ac.uk/ sam/stringmetrics.html .

Search WWH ::

Custom Search

Home