Information Technology Reference
In-Depth Information
1
Using Layout Data for the Analysis of
Scientific Literature
Brigitte Mathiak, Andreas Kupfer, and Silke Eckstein
Technische Universitat Braunschweig, Germany
Institute of Information Systems
mathiak@gmail.com, { kupfer, eckstein } @ifis.cs.tu-bs.de
Summary. It is said that the world knowledge is in the Internet. Scientific knowledge
is in the topics, journals and conference proceedings. Yet both repositories are too large
to skim through manually. We need clever algorithms to cope with the huge amount of
information. To filter, sort and ultimately mine the information available it is vital to
use every source of information we have. A common technique is to mine the text from
the publications, but they are more complex than the text they include. The position of
the words gives us clues about their meaning. Additional images either supplement the
text or offer proof to a proposition. Tables cannot be understood before deciphering
the rows and columns. To deal with the additional information, classic text mining
techniques have to be coupled with spatial data and image data. In this chapter, we will
give some background to the various techniques, explain the necessary pre-processing
steps involved and present two case studies, one from image mining and one from table
identification.
1.1
Introduction
A scientific document is more complex than it seems. While readers can easily de-
duce structure and semantics of the different characters and pictures on a page,
most of this structure information is not stored and available when automat-
ically accessing the publication. Most text mining applications from scientific
literature are trying to find facts. In biology, these are facts like gene-to-gene
relationships [1, 2] or gene expression profiling [3], mostly in abstracts [4], as
these are most easily available. It is shown that these techniques can give biolog-
ically significant results [5]. Yet, more information can be obtained by searching
through full text paper [6]. In [7] it was shown that different kinds of information
are stored in different kinds of sections. Although the abstract has the highest
information density, other sections contain viable information as well. In [8], it
has been observed that analyzing the figure caption is of great value. Classifying
the different sections of a paper to analyse them separately like in [9] has been
successfully attempted, but curiously, the figure captions and tables have not
been examined.
 
Search WWH ::




Custom Search