Databases Reference
In-Depth Information
3.5
Holistic Schema Matching
While most of the previous match work focuses on pairwise matching, there has also
been some work on the generalized problem of matching
n
schemas. Typically, the
goal is to integrate or merge the
n
schemas such that all matching elements of the
n
-way
matching can be implemented by a series of 2-way match steps, and some systems
such as Porsche follow such an approach and incrementally merge schemas ( Saleem
et al. 2008 ). The alternative is a holistic matching that clusters all matching schema
elements at once.
The holistic approach has primarily been considered for the use case of match-
ing and integrating web forms for querying deep web sources ( He et al. 2004 ; He
and Chang 2006 ; Su et al. 2006 ). While there are typically many web forms to inte-
grate in a domain, the respective schemas are mostly small and simple, e.g., a list of
attributes. Hence, the main task is to group together all similar attributes. Matching
is primarily performed on the attribute names (labels) but may also use additional
information such as comments or sample values. A main observation utilized in
holistic schema matching is the correlation of attribute names, particularly that sim-
ilar names between different schemas are likely matches but similar names within
the same schema are usually mismatches. For example, the attribute's first name and
last name do not match if they co-occur in the same source.
The dual correlation mining (DCM) approach of He and Chang ( 2006 ) utilizes
these positive and negative attribute correlations for matching. It also utilizes neg-
ative correlations to derive complex relationships, e.g., that attribute name matches
the combination of both first name and last name . The HSM approach of Su et al.
( 2006 ) extends the DCM scheme for improved accuracy and efficiency. HSM also
utilizes that the vocabulary of web forms in a domain tends to be relatively small
and that terms are usually unambiguous in a domain (e.g., title in a topic domain).
A main idea is to first identify such shared attributes (and their synonyms) in the
input schemas and exclude such attributes from matching the remaining attributes
for improved efficiency and accuracy.
Das Sarma et al. ( 2008 ) propose to determine a so-called probabilistic mediated
schema from
schemas are represented only once in the integrated (mediated) schema.
N
input schemas, which is in effect a ranked list of several mediated
schemas. The approach observes the inherent uncertainty of match decisions but
avoids any manual intervention by considering not only one but several reasonable
mappings. The resulting set of mediated schemas was shown to provide queries with
potentially more complete results than with a single mediated schema. The proposed
approach only considers the more frequently occurring attributes for determining
the different mediated schemas, i.e., sets of disjoint attribute clusters. Clustering is
based on the pairwise similarity between any of the remaining attributes exceeding
a certain threshold as well as the co-occurrence of attributes in the same source. The
similarity between attributes can also be considered as uncertain by some error mar-
gin, which leads to different possibilities to cluster such attributes within different
mediated schemas. The probabilistic mapping approach is further described in the
companion topic chapter ( Das Sarma et al. 2011 ).
n
Search WWH ::




Custom Search