Information Technology Reference
In-Depth Information
1.3.1
Finding the Literature
The first step in pre-processing is to build a corpus. In order to find a suitable
corpus for a specific topic, several open search engines and databases, such as
Google, Citeseer or PubMed can be used. Still, a number of papers are not free
and can only be downloaded from the publisher website. When downloading
from the original publisher website, it is often quite dicult to find the specific
paper, as they often have different structures for storing their publications.
The corporae for our case studies were either provided directly by our collab-
oration partners or had to be acquired using said databases. To build a corpus
from scratch automatically, we developed a downloading tool. It only needs a
broad search query to describe the topic. That query is fed into suitable open
databases and then cross-referenced with the available publisher websites. From
there, we should be able to find a link to the full paper. The page is crawled and
every link weighed according to whether they contain interesting keywords, like
“pdf”, “reprint”, volume number or issue number. Also, we give penalties for
less interesting terms like “abstract”, “guide”, “faq”, and so on. The links are
followed in the order of their weights; negative weight links are never followed.
The first link directed to a PDF document is downloaded. In order to prevent
the downloading of unwanted documents, we also compiled a blacklist of terms
not allowed to be part of a followed link, like “manual”, “adobe.com” and so on.
When we do not find a suitable link to follow, we assume that we do not have
permission to download the paper. The downloader is able to handle multiple
proxy configurations, so it is possible to use different licenses simultaneously. So,
when we do not find a document, we try to open an alternative connection using
another license, or an alternative link given by the database.
The process of downloading is relatively slow, because we want to avoid over-
loading the source databases. It takes roughly 30 seconds per document, most of
that being generous timeouts due to the local rules of the databases involved. The
downloader is quite successful. In a test with the query “gene expression microar-
ray rat”, there were 1244 full paper links given by PubMed, the database spe-
cialising in biomedical literature. By using a regular library license, we achieved
to download 598 of them, plus only 34 false positives. Searching for the same
documents by just using the free text links provided by PubMed, there were only
485.
The system's precision can be improved through usage, as we can backtrack
where the false positives came from and add more terms to the blacklist or the
penalty list. An improvement of recall is dicult, as the “missed” paper is rarely
noticed.
1.3.2
Extracting Layout Information from the PDF Format
A typical vector-based format like PDF or Postscript does not directly give
the position of the text. The description is not pixel-based, but a description
of lines and curves that form the text. Like that, the information about the
document stays more authentic, because scaling does not matter and a lot of the
Search WWH ::




Custom Search