Game Development Reference
In-Depth Information
further split by the types of relationship they are able to provide (e.g., non-named,
taxonomic or typed). Some approaches are dedicated on relationship naming
(which is relevant for example when structuring the lightweight semantics). In
resource description acquisition, the approaches can be classified by type of
resource they aim to describe: texts, web pages, images, music, videos, etc.
2.4.1 Entity Recognition, Instance Population
When extracting facts from natural language text corpora, the first task for the extrac-
tion method is to identify the relevant entities, which will take part as subjects and
objects of the triplets. These entities are directly mentioned in the analyzed text.
Although, we can imagine that the list of entities (concepts or instances) related to
a certain textual resource is possibly wider than the list of “meaningful” lexical units
actually present in the text. E.g., there is an article about the American president,
without the actual presence of lexical units “American” and “president” in it.
Before entity recognition, the text is usually preprocessed by tokenization and
lemmatization (or stemming is used) producing basic morph terms (or stems). Then,
the stopwords are removed. These are usually supportive words carrying no semantic
meaning (e.g., prepositions). The stopwords sometimes contain meaningful words
too—in cases they are very frequent in the corpus and thus cannot be used to distinct
between resources.
After the preprocessing, the entities are selected, depending on the method.
A relatively naive approach is the selection of terms belonging to noun part of speech
(which are identified using a dictionary likeWordNet). In case of named entity recog-
nition, capitalized terms are selected. Some entity recognition algorithms also solve
the possible polysemy (multiplemeanings of the same lexemes) of terms, for example
by exploiting the existing concept collocation database or thesaurus [ 46 ].
Note that the named entities require quite different approach for identification.
They comprise personal and company names, shortcuts, geographical locations, etc.
They are usually not present in the dictionaries. Some approaches for named entity
recognition rely on building extensive datasets of such names, which do not have to
be necessarily manually created. As example, we can take gazetteer lists constructed
by machine learning in the work of Kozareva [ 35 ]. A particular problem with named
entity recognition is meaning disambiguation (introduced by homonyms), which is
being solved by approaches working with term contexts [ 31 ].
Apart from “document-driven” approaches that are used for annotating docu-
ments the “ontology-driven” approaches focus on populating the domain (or general)
ontologies by expanding their hierarchical structure. An example of such approach
is the OntoSyphon (created by McDowell and Cafarella), which has a purpose of
finding instances (or subclasses) for a given ontology class [ 46 ]. Their approach,
designed to work independently on domain it is used in, takes a domain ontology
and text corpus (ultimately—the whole Web) as an input and outputs a ranked list of
 
Search WWH ::




Custom Search