Towards Large-Scale Schema and Ontology Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

instances in applications such as matching product catalogs or web directories. On

the other hand, the classification problem becomes inherently more difficult to solve

for increasing numbers of concepts. The GLUE evaluation in Doan et al. ( 2003 )

was restricted to comparatively small match tasks with ontology sizes between 31

and 331 concepts. The SAMBO approach was evaluated for even smaller (sub-)

ontologies (10-112 concepts). Effectiveness and efficiency of the machine learn-

ing approaches to large-scale match tasks with thousands of concepts is thus an

open issue.

2.2.2

Usage-Based Matching

Two recent works propose the use of query logs to aid in schema matching. In

Elmeleegy et al. ( 2008 ), SQL query logs are analyzed to find attributes with similar

usage characteristics (e.g., within join conditions or aggregations) and occurrence

frequencies as possible match candidates for relational database schemas. The

Hamster approach ( Nandi and Bernstein 2009 ) uses the click log for keyword

queries of an entity search engine to determine the search terms, leading to instances

of specific categories of a taxonomy (e.g., product catalog or web directory). Cate-

gories of different taxonomies sharing similar search queries are then considered as

match candidates. Different search terms referring to the same categories are also

potential synonyms that can be utilized not only for matching but also for other

purposes such as the improvement of user queries.

A main problem of usage-based matching is the difficulty to obtain suitable usage

data, which is likely more severe than the availability of instance data. For example,

the click logs for the Hamster approach are only available to the providers of search

engines. Furthermore, matching support can primarily be obtained for categories or

schema elements receiving many queries.

3

Techniques for Large-Scale Matching

In this section, we provide an overview about recent approaches for large-scale pair-

wise matching that go beyond specific matchers but address entire match strategies.

In particular, we discuss approaches in four areas that we consider as especially

promising and important:

- Reduction of search space for matching (early pruning of dissimilar element

pairs, partition-based matching)

- Parallel matching

- Self-tuning match workflows

- Reuse of previous match results

We also discuss proposed approaches for holistically matching

n

schemas.

Search WWH ::

Custom Search

Home