Towards Large-Scale Schema and Ontology Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

as surveyed in ( Rahm and Bernstein 2001 ; Euzenat and Shvaiko 2007 ), particularly

metadata-based and instance-based matchers. Metadata-based matchers are most

common and exploit characteristics of schema or ontology elements such as their

names, comments, data types, as well as structural properties. Instance-based match-

ers determine the similarity between schema elements from the similarity of their

instances; this class of matchers has recently been studied primarily for matching

large ontologies and will be discussed in more detail below.

Further matching techniques exploit different kinds of auxiliary (background)

information to improve or complement metadata- and instance-based matchers.

For example, name matching for both schema elements and instance values can

be enhanced by general thesauri such as Wordnet or, for improved precision,

domain-specific synonym lists and thesauri (e.g., UMLS as a biomedical reference).

Furthermore, search engines can be used to determine the similarity between names,

e.g., by using the relative search result cardinality for different pairs of names as a

similarity indicator ( Gligorov et al. 2007 ). At the end of this section, we will briefly

discuss a further kind of match technique, the recently proposed consideration of

usage information for matching.

Efficiently matching large schemas and ontologies implies that every matcher

should impose minimal CPU and memory requirements. For improving linguis-

tic matching, many techniques for efficiently computing string similarities can

be exploited, e.g., for tokenization and indexing ( Koudas et al. 2004 ). Structural

matching can be optimized by precollecting the predecessors and children of every

element, e.g., in database tables, instead of repeatedly traversing large graph struc-

tures ( Algergawy et al. 2009 ). Such an approach can also avoid the need of keeping

a graph representation of the schemas in memory that can become a bottleneck with

large schemas. The results of matchers are often stored within similarity matrices

containing a similarity value for every combination of schema elements. With large

schemas, these matrices may require millions of entries and thus several hundreds

of MB memory. To avoid a memory bottleneck, a more space-efficient storage of

matcher results becomes necessary, e.g., by using hash tables ( Bernstein et al. 2004 ).

In Sect. 3 , we will discuss further performance techniques such as parallel matcher

execution.

In the following, we first describe a general workflow-like approach to apply

multiple matchers and to combine their results. We then discuss approaches for

instance-based ontology matching and usage-based matching.

2.1

Match Workflows

Figure 1.1 a shows a general workflow for automatic, pairwise schema matching as

being used in many current match systems. The schemas are first imported into an

internal processing format. Further preprocessing may be applied such as analysis

of schema features or indexing name tokens to prepare for a faster computation of

name similarities. The main part is a subworkflow to execute several matchers each

Search WWH ::

Custom Search

Home