State-of-the-Art: Semantics Acquisition and Crowdsourcing - Semantic Acquisition Games: Harnessing Manpower for Creating Semantics

Game Development Reference

In-Depth Information

further split by the types of relationship they are able to provide (e.g., non-named,

taxonomic or typed). Some approaches are dedicated on relationship naming

(which is relevant for example when structuring the lightweight semantics). In

resource description acquisition, the approaches can be classified by type of

resource they aim to describe: texts, web pages, images, music, videos, etc.

2.4.1 Entity Recognition, Instance Population

When extracting facts from natural language text corpora, the first task for the extrac-

tion method is to identify the relevant entities, which will take part as subjects and

objects of the triplets. These entities are directly mentioned in the analyzed text.

Although, we can imagine that the list of entities (concepts or instances) related to

a certain textual resource is possibly wider than the list of “meaningful” lexical units

actually present in the text. E.g., there is an article about the American president,

without the actual presence of lexical units “American” and “president” in it.

Before entity recognition, the text is usually preprocessed by tokenization and

lemmatization (or stemming is used) producing basic morph terms (or stems). Then,

the stopwords are removed. These are usually supportive words carrying no semantic

meaning (e.g., prepositions). The stopwords sometimes contain meaningful words

too—in cases they are very frequent in the corpus and thus cannot be used to distinct

between resources.

After the preprocessing, the entities are selected, depending on the method.

A relatively naive approach is the selection of terms belonging to noun part of speech

(which are identified using a dictionary likeWordNet). In case of named entity recog-

nition, capitalized terms are selected. Some entity recognition algorithms also solve

the possible polysemy (multiplemeanings of the same lexemes) of terms, for example

by exploiting the existing concept collocation database or thesaurus [ 46 ].

Note that the named entities require quite different approach for identification.

They comprise personal and company names, shortcuts, geographical locations, etc.

They are usually not present in the dictionaries. Some approaches for named entity

recognition rely on building extensive datasets of such names, which do not have to

be necessarily manually created. As example, we can take gazetteer lists constructed

by machine learning in the work of Kozareva [ 35 ]. A particular problem with named

entity recognition is meaning disambiguation (introduced by homonyms), which is

being solved by approaches working with term contexts [ 31 ].

Apart from “document-driven” approaches that are used for annotating docu-

ments the “ontology-driven” approaches focus on populating the domain (or general)

ontologies by expanding their hierarchical structure. An example of such approach

is the OntoSyphon (created by McDowell and Cafarella), which has a purpose of

finding instances (or subclasses) for a given ontology class [ 46 ]. Their approach,

designed to work independently on domain it is used in, takes a domain ontology

and text corpus (ultimately—the whole Web) as an input and outputs a ranked list of

Search WWH ::

Custom Search

Home