Databases Reference
In-Depth Information
instances in applications such as matching product catalogs or web directories. On
the other hand, the classification problem becomes inherently more difficult to solve
for increasing numbers of concepts. The GLUE evaluation in Doan et al. ( 2003 )
was restricted to comparatively small match tasks with ontology sizes between 31
and 331 concepts. The SAMBO approach was evaluated for even smaller (sub-)
ontologies (10-112 concepts). Effectiveness and efficiency of the machine learn-
ing approaches to large-scale match tasks with thousands of concepts is thus an
open issue.
2.2.2
Usage-Based Matching
Two recent works propose the use of query logs to aid in schema matching. In
Elmeleegy et al. ( 2008 ), SQL query logs are analyzed to find attributes with similar
usage characteristics (e.g., within join conditions or aggregations) and occurrence
frequencies as possible match candidates for relational database schemas. The
Hamster approach ( Nandi and Bernstein 2009 ) uses the click log for keyword
queries of an entity search engine to determine the search terms, leading to instances
of specific categories of a taxonomy (e.g., product catalog or web directory). Cate-
gories of different taxonomies sharing similar search queries are then considered as
match candidates. Different search terms referring to the same categories are also
potential synonyms that can be utilized not only for matching but also for other
purposes such as the improvement of user queries.
A main problem of usage-based matching is the difficulty to obtain suitable usage
data, which is likely more severe than the availability of instance data. For example,
the click logs for the Hamster approach are only available to the providers of search
engines. Furthermore, matching support can primarily be obtained for categories or
schema elements receiving many queries.
3
Techniques for Large-Scale Matching
In this section, we provide an overview about recent approaches for large-scale pair-
wise matching that go beyond specific matchers but address entire match strategies.
In particular, we discuss approaches in four areas that we consider as especially
promising and important:
- Reduction of search space for matching (early pruning of dissimilar element
pairs, partition-based matching)
- Parallel matching
- Self-tuning match workflows
- Reuse of previous match results
We also discuss proposed approaches for holistically matching
n
schemas.
Search WWH ::




Custom Search