Geoscience Reference
In-Depth Information
Spatial and Temporal Information
Retrieval in Textual Corpora
2.1. Introduction
Information retrieval (IR) systems intended for the wider public do not offer
specific processing of spatial or temporal information contained within the corpora or
search criteria. Nevertheless, in numerous cases, these pieces of information could
play an important role in the calculation of the relevance of a document [TEI 11].
Consideration of the semantics of spatial and temporal expressions could enable a
finer processing of expressions such as “musical instruments in the vicinity of Laruns
at the beginning of the 19 th Century”. Most of the time in IR, however, documents
are processed from the viewpoint of their textual content as mere “bags” of
independent words. Moreover, beyond the textual content, document-specific
information could be taken into consideration such as the structuring in sections and
paragraphs, for example.
Our context relative to textual corpora with “territorial” denotations is specific.
On the one hand, spatial and temporal references are frequent and, on the other hand,
thedocument repositoriesare sufficientlystableand homogeneous towarrantspecific
back-office processing. Our work is thus different from classic IR since it aims a
thorough processing of content: specific process flows target the recognition
followed by the interpretation of spatial and temporal information. From a structural
point of view, the documents at our disposal are acquired from basic digitalization
efforts integrating only character and paragraph recognition. They are in text format
and are generally composed of several tens or hundreds of pages. This is the reason
why we believe the entry point in the corpus cannot be the document itself and we
propose working with paragraphs as document units.
As recommended by Clough et al. [CLO 06], we deal independently with spatial
and temporal dimensions: this way, the single-dimension IR and the operation for
