Databases Reference
In-Depth Information
When training examples are available, the methods of choice are borrowed from
supervised machine learning. Approaches such as Hidden Markov Models [168], Max-
imum Entropy Models [35] and Conditional Random Fields [45] have been applied
to the NER task. Due to scarcity of large training corpora as necessitated by machine
learning approaches, semi-supervised [125,105] and unsupervised machine learning ap-
proaches [107,41] have also been used for extracting NER from text. [105] gives an
exhaustive overview of approaches for NER.
Keyphrase Extraction. Keyphrases
Keywords are multi-word units (MWUs) which
capture the main topics of a document. The automatic detection of such MWUs has
been an important task of NLP for decades but due to the very ambiguous defini-
tion of what an appropriate keyword should be, current approaches to the extraction
of keyphrases still display low F-scores [75]. From the point of view of the Semantic
Web, the extraction of keyphrases is a very similar task to that of finding tags for a given
document. Several categories of approaches have been adapted to enable KE, of which
some originate from research areas such as summarization and information retrieval
(IR). Still, according to [74], the majority of the approaches to KE implement combina-
tions of statistical, rule-based or heuristic methods [48,120] on mostly document [97],
keyphrase [149] or term cohesion features [124]. [75] gives a overview of current tools
for KE.
/
Relation Extraction. The extraction of relations from unstructured data builds upon
work for NER and KE to determine the entities between which relations might exist.
Most tools for RE rely on pattern-based approaches. Some early work on pattern extrac-
tion relied on supervised machine learning [51]. Yet, such approaches demanded large
amount of training data, making them di
cult to adapt to new relations. The subse-
quent generation of approaches to RE aimed at bootstrapping patterns based on a small
number of input patterns and instances. For example, [28] presents the Dual Iterative
Pattern Relation Expansion (DIPRE) and applies it to the detection of relations between
authors and titles of topics. This approach relies on a small set of seed patterns to max-
imize the precision of the patterns for a given relation while minimizing their error rate
of the same patterns. Snowball [3] extends DIPRE by a new approach to the generation
of seed tuples. Newer approaches aim to either collect redundancy information from the
whole Web [123] or Wikipedia [158,164] in an unsupervised manner or to use linguistic
analysis [53,119] to harvest generic patterns for relations.
URI Disambiguation. One important problem for the integration of NER tools for
Linked Data is the retrieval of IRIs for the entities to be manipulated. In most cases,
the URIs can be extracted from generic knowledge bases such as DBpedia [104,83]
by comparing the label found in the input data with the rdfs:label or dc:title of
the entities found in the knowledge base. Furthermore, information such as the type of
NEs can be used to filter the retrieved IRIs via a comparison of the rdfs:label of
the rdf:type of the URIs with the name of class of the NEs. Still in many cases (e.g.,
Leipzig, Paris), several entities might bear the same label.
Search WWH ::




Custom Search