Spatial and Temporal Information Retrieval in Textual Corpora - Geographical Information Retrieval in Textual Corpora

Geoscience Reference

In-Depth Information

Toannotatethe information as we haveseen in the previoussections, it first has to

beidentifiedinthetextsandinterpretedduringtheextractionphase.Thisphaseallows

a markup of the text in an iterative way in order to constitute indexes later on.

2.3.5. Linguistic processing

The processes of information extraction target all the information within a textual

document (full-text indexing) or only specific elements of information (targeted

indexing). In the first case, the extracted terms are weighted via so-called statistic

approaches: all the terms within a document are processed [MAN 08b,

Chapters 2-4]. In the second case, to the contrary, the extraction of the information is

based on pre-defined linguistic rules in order to target only particular pieces of

information that are generally unweighted [GAI 03, ABO 03].

A spatial and temporal textual content process flow is generally composed of the

three following modules:

- Recognition of NEs. Lexical analysis allows the conversion of a stream of

charactersintoastreamofwordsorterms[BAE 99]: tokenization splitsthedocument

intowordsfollowingapredefinedlistofdelimiters.Thesewordsarethentransformed

into lexemes during a lexical and morphological analysis. Some less discriminatory

and common words (“at”, “to”, for example) can be eliminated through a so-called

list of stop words (or stoplist). Finally, an NE detector tags the candidate entities.

- Validation of NEs. Knowledge repositories are used for the validation of the

candidate NEs.

- Interpretation of NEs. Syntactical analysis, based on the rules of grammar,

detects relations between lexemes. Finally, a semantic analysis applied to such

groups of lexemes (“south of Pau”, “torrent of Pau”, “beginning of January 2010”,

for example) allows the processing of adapted interpretation rules. Knowledge

repositories are used to disambiguate and then associate representations with the

entities.

NLP platforms such as GATE [CUN 02], LINGUASTREAM [BIL 06b],

MIRACLE [LIU 06], SXPIPE [SAG 08] and UIMA [FER 04] are dedicated to

linguistic processing, and are therefore well adapted to these particular processes. A

comparative study is presented in [SAG 08].

Let us note that we address these processes relative to the extraction of

information contained within textual documents in the particular framework of IR.

Spatial and temporal indexes must therefore be established, which will serve as a

support for the IR.

Search WWH ::

Custom Search

Home