Databases Reference
In-Depth Information
as surveyed in ( Rahm and Bernstein 2001 ; Euzenat and Shvaiko 2007 ), particularly
metadata-based and instance-based matchers. Metadata-based matchers are most
common and exploit characteristics of schema or ontology elements such as their
names, comments, data types, as well as structural properties. Instance-based match-
ers determine the similarity between schema elements from the similarity of their
instances; this class of matchers has recently been studied primarily for matching
large ontologies and will be discussed in more detail below.
Further matching techniques exploit different kinds of auxiliary (background)
information to improve or complement metadata- and instance-based matchers.
For example, name matching for both schema elements and instance values can
be enhanced by general thesauri such as Wordnet or, for improved precision,
domain-specific synonym lists and thesauri (e.g., UMLS as a biomedical reference).
Furthermore, search engines can be used to determine the similarity between names,
e.g., by using the relative search result cardinality for different pairs of names as a
similarity indicator ( Gligorov et al. 2007 ). At the end of this section, we will briefly
discuss a further kind of match technique, the recently proposed consideration of
usage information for matching.
Efficiently matching large schemas and ontologies implies that every matcher
should impose minimal CPU and memory requirements. For improving linguis-
tic matching, many techniques for efficiently computing string similarities can
be exploited, e.g., for tokenization and indexing ( Koudas et al. 2004 ). Structural
matching can be optimized by precollecting the predecessors and children of every
element, e.g., in database tables, instead of repeatedly traversing large graph struc-
tures ( Algergawy et al. 2009 ). Such an approach can also avoid the need of keeping
a graph representation of the schemas in memory that can become a bottleneck with
large schemas. The results of matchers are often stored within similarity matrices
containing a similarity value for every combination of schema elements. With large
schemas, these matrices may require millions of entries and thus several hundreds
of MB memory. To avoid a memory bottleneck, a more space-efficient storage of
matcher results becomes necessary, e.g., by using hash tables ( Bernstein et al. 2004 ).
In Sect. 3 , we will discuss further performance techniques such as parallel matcher
execution.
In the following, we first describe a general workflow-like approach to apply
multiple matchers and to combine their results. We then discuss approaches for
instance-based ontology matching and usage-based matching.
2.1
Match Workflows
Figure 1.1 a shows a general workflow for automatic, pairwise schema matching as
being used in many current match systems. The schemas are first imported into an
internal processing format. Further preprocessing may be applied such as analysis
of schema features or indexing name tokens to prepare for a faster computation of
name similarities. The main part is a subworkflow to execute several matchers each
Search WWH ::




Custom Search