Geoscience Reference
In-Depth Information
Toannotatethe information as we haveseen in the previoussections, it first has to
beidentifiedinthetextsandinterpretedduringtheextractionphase.Thisphaseallows
a markup of the text in an iterative way in order to constitute indexes later on.
2.3.5. Linguistic processing
The processes of information extraction target all the information within a textual
document (full-text indexing) or only specific elements of information (targeted
indexing). In the first case, the extracted terms are weighted via so-called statistic
approaches: all the terms within a document are processed [MAN 08b,
Chapters 2-4]. In the second case, to the contrary, the extraction of the information is
based on pre-defined linguistic rules in order to target only particular pieces of
information that are generally unweighted [GAI 03, ABO 03].
A spatial and temporal textual content process flow is generally composed of the
three following modules:
- Recognition of NEs. Lexical analysis allows the conversion of a stream of
charactersintoastreamofwordsorterms[BAE 99]: tokenization splitsthedocument
intowordsfollowingapredefinedlistofdelimiters.Thesewordsarethentransformed
into lexemes during a lexical and morphological analysis. Some less discriminatory
and common words (“at”, “to”, for example) can be eliminated through a so-called
list of stop words (or stoplist). Finally, an NE detector tags the candidate entities.
- Validation of NEs. Knowledge repositories are used for the validation of the
candidate NEs.
- Interpretation of NEs. Syntactical analysis, based on the rules of grammar,
detects relations between lexemes. Finally, a semantic analysis applied to such
groups of lexemes (“south of Pau”, “torrent of Pau”, “beginning of January 2010”,
for example) allows the processing of adapted interpretation rules. Knowledge
repositories are used to disambiguate and then associate representations with the
entities.
NLP platforms such as GATE [CUN 02], LINGUASTREAM [BIL 06b],
MIRACLE [LIU 06], SXPIPE [SAG 08] and UIMA [FER 04] are dedicated to
linguistic processing, and are therefore well adapted to these particular processes. A
comparative study is presented in [SAG 08].
Let us note that we address these processes relative to the extraction of
information contained within textual documents in the particular framework of IR.
Spatial and temporal indexes must therefore be established, which will serve as a
support for the IR.
Search WWH ::




Custom Search