Database Reference
In-Depth Information
8.6.2 a utoMatic d iScovery and c reation
We use the terminology matching to mean seeking out equivalences, at either the
class or the instance level, versus linking , for which we are trying to relate two
classes/instances by some other relationship. Both of these cases are commonly
referred to as linking, but we feel the distinction is worth making. In some ways,
matching is the simpler case, so we address it first.
Currently, people use top-down matching rules, such as string-matching or hierar-
chical correspondence, to match two instances, along with various, usually heuristi-
cally derived, similarity measures. For example, when matching two places, we could
combine the result of string matching their two place names, with some distance mea-
sure between their latitude and longitudes; of course, this is made a lot easier if the two
datasets are using the same set of properties, so ontology matching is also required. An
alternative to try is bootstrapping , a bottom-up approach that uses a small set of manu-
ally matched or linked instances to derive a more general matching/linking rule for
similar cases. So, for example, if I have stated that mm:Mereashire owl:sameAs
dbpedia:MereaCounty and so on for several other counties, I could derive (a) that
mm:County owl:equivalentTo dbpedia:County , and (b) that it would be
worth doing string matching on other counties in the two sets of counties.
There are several tools that can assist with link discovery, for example, Silk, 12
which is a graphical tool for identifying links between one RDF dataset and
another on the Linked Data Web. Another tool that may be of use is the Linked
Data Integration Framework, 13 which works with a Silk link-mapping specifica-
tion and handles the disparities that can occur when some datasets are RDF/XML
dumps only, while others are offered via SPARQL endpoints. The LIMES 14 (Link
Discovery for Metric Spaces) tool has both a stand-alone option and a Web interface
that works with SPARQL endpoints. LIMES works by finding a set of examples in
the target dataset and matching each of the instances in that target dataset to their
nearest example. Next, the distance between each target example and all the source
instances is calculated, and any obvious mismatches (which have a large distance)
are filtered out. Then, the actual distances between the source instances and the
most likely target instances are calculated. This approach reduces the search space
and number of similarity calculations that have to be carried out. Finally, the source
and target instances with the highest similarity are output in N-Triples format.
Another approach to link discovery is to use Bayesian belief networks, for example,
the RiMOM 15 (Risk Minimization-Based Ontology Mapping) tool; however, this is
limited to demonstration with a benchmark dataset only.
At the time of writing, automated link discovery was still a very immature area and
the subject of ongoing research. Most of the tools described have significant limita-
tions with accuracy, scale, or robustness and are for the most part still emerging from
the universities where they were developed. They are therefore not yet mature enough
to offer commercial-quality solutions to the problem of link creation. Nevertheless,
they are indicative of how the technology is developing. For specific datasets, the
advice is still to write one's own link discovery scripts based on knowledge of the
datasets as this will produce higher accuracy than these more general tools.
Search WWH ::




Custom Search