Towards Large-Scale Schema and Ontology Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

3.5

Holistic Schema Matching

While most of the previous match work focuses on pairwise matching, there has also

been some work on the generalized problem of matching

n

schemas. Typically, the

goal is to integrate or merge the

n

schemas such that all matching elements of the

n

-way

matching can be implemented by a series of 2-way match steps, and some systems

such as Porsche follow such an approach and incrementally merge schemas ( Saleem

et al. 2008 ). The alternative is a holistic matching that clusters all matching schema

elements at once.

The holistic approach has primarily been considered for the use case of match-

ing and integrating web forms for querying deep web sources ( He et al. 2004 ; He

and Chang 2006 ; Su et al. 2006 ). While there are typically many web forms to inte-

grate in a domain, the respective schemas are mostly small and simple, e.g., a list of

attributes. Hence, the main task is to group together all similar attributes. Matching

is primarily performed on the attribute names (labels) but may also use additional

information such as comments or sample values. A main observation utilized in

holistic schema matching is the correlation of attribute names, particularly that sim-

ilar names between different schemas are likely matches but similar names within

the same schema are usually mismatches. For example, the attribute's first name and

last name do not match if they co-occur in the same source.

The dual correlation mining (DCM) approach of He and Chang ( 2006 ) utilizes

these positive and negative attribute correlations for matching. It also utilizes neg-

ative correlations to derive complex relationships, e.g., that attribute name matches

the combination of both first name and last name . The HSM approach of Su et al.

( 2006 ) extends the DCM scheme for improved accuracy and efficiency. HSM also

utilizes that the vocabulary of web forms in a domain tends to be relatively small

and that terms are usually unambiguous in a domain (e.g., title in a topic domain).

A main idea is to first identify such shared attributes (and their synonyms) in the

input schemas and exclude such attributes from matching the remaining attributes

for improved efficiency and accuracy.

Das Sarma et al. ( 2008 ) propose to determine a so-called probabilistic mediated

schema from

schemas are represented only once in the integrated (mediated) schema.

N

input schemas, which is in effect a ranked list of several mediated

schemas. The approach observes the inherent uncertainty of match decisions but

avoids any manual intervention by considering not only one but several reasonable

mappings. The resulting set of mediated schemas was shown to provide queries with

potentially more complete results than with a single mediated schema. The proposed

approach only considers the more frequently occurring attributes for determining

the different mediated schemas, i.e., sets of disjoint attribute clusters. Clustering is

based on the pairwise similarity between any of the remaining attributes exceeding

a certain threshold as well as the co-occurrence of attributes in the same source. The

similarity between attributes can also be considered as uncertain by some error mar-

gin, which leads to different possibilities to cluster such attributes within different

mediated schemas. The probabilistic mapping approach is further described in the

companion topic chapter ( Das Sarma et al. 2011 ).

n

Schema Matching and Mapping

Search WWH ::

Custom Search

Home