Databases Reference
In-Depth Information
3.2
Finding Contextual Attribute Correspondences
A few challenges arise when designing an algorithm for finding contextual attribute
correspondences. First, one may risk overfitting the correspondences to the train-
ing data. For example, it is possible that one could find a contextual attribute
correspondence stating
. R : CardInfo : expiryMonth ; S : HotelCardInformation : expiryMonth ; R : CardInfo : securityCode > 333 / ;
which is clearly inappropriate, since the security code is associated with the card
number and not with its expiry. A naıve classifier may fall into this trap simply by
some bias in the training dataset that assigns more cards with higher values of the
securityCode attribute.
A second challenge involves situations in which the contextual attribute corre-
spondences are not specializations of (noncontextual) attribute correspondences and
therefore, cannot be identified as refinements of the outcome of existing matchers.
As an example, consider our case study application. R.HotelInfo.neighborhood
provides neighborhood information for medium-size cities. However, for bigger
cities, it prefers a more accurate positioning of the hotel, using subway station
names as the neighborhood information. Therefore, a possible contextual attribute
correspondence may be
. R : HotelInfo : neighborhood ; T : Subway : station ; R : HotelInfo : city
D
' Moscow '/:
However, this is not a refinement of an attribute correspondence . R : HotelInfo :
neighborhood ; T : Subway : station /.
An approach for discovering contextual matches was introduced in Bohannon
et al. [ 2006 ]. Let M i;j be the score of matching attributes S:A i with S:A j .Given
a condition c, a matcher can use the subset of the instance problem that satisfies c
to provide a new score M i;j
. The difference M i;j M i;j is the improvement of
the contextual attribute correspondence. Given the set of conditions C , we can cre-
ate a contextual attribute correspondence using the condition c that maximizes
the improvement measure. Using an improvement threshold can solve the overfit-
ting challenge. However, thresholds are always tricky. A threshold that is set too
low introduces false positives while a threshold that is too high may introduce false
negatives. Using machine learning techniques to tune thresholds has proven to be
effective in schema matching [ Lee et al. 2007 ]. However, as was shown in Gal
[ 2006 ], it is impossible to set thresholds that will avoid this false negative/false
positive trade-off.
It has been proposed in Bohannon et al. [ 2006 ]thatk-contexts with k>1
will yield more trustworthy contextual attribute correspondences. The algorithm
first determines an initial list of 1-context conditions. Then, it creates and evalu-
ates disjunctive conditions that are generated from the original 1-context conditions.
The generation of conditions is carried out using view selection. Views are chosen
Search WWH ::




Custom Search