Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

1.3.1

Finding the Literature

The first step in pre-processing is to build a corpus. In order to find a suitable

corpus for a specific topic, several open search engines and databases, such as

Google, Citeseer or PubMed can be used. Still, a number of papers are not free

and can only be downloaded from the publisher website. When downloading

from the original publisher website, it is often quite dicult to find the specific

paper, as they often have different structures for storing their publications.

The corporae for our case studies were either provided directly by our collab-

oration partners or had to be acquired using said databases. To build a corpus

from scratch automatically, we developed a downloading tool. It only needs a

broad search query to describe the topic. That query is fed into suitable open

databases and then cross-referenced with the available publisher websites. From

there, we should be able to find a link to the full paper. The page is crawled and

every link weighed according to whether they contain interesting keywords, like

“pdf”, “reprint”, volume number or issue number. Also, we give penalties for

less interesting terms like “abstract”, “guide”, “faq”, and so on. The links are

followed in the order of their weights; negative weight links are never followed.

The first link directed to a PDF document is downloaded. In order to prevent

the downloading of unwanted documents, we also compiled a blacklist of terms

not allowed to be part of a followed link, like “manual”, “adobe.com” and so on.

When we do not find a suitable link to follow, we assume that we do not have

permission to download the paper. The downloader is able to handle multiple

proxy configurations, so it is possible to use different licenses simultaneously. So,

when we do not find a document, we try to open an alternative connection using

another license, or an alternative link given by the database.

The process of downloading is relatively slow, because we want to avoid over-

loading the source databases. It takes roughly 30 seconds per document, most of

that being generous timeouts due to the local rules of the databases involved. The

downloader is quite successful. In a test with the query “gene expression microar-

ray rat”, there were 1244 full paper links given by PubMed, the database spe-

cialising in biomedical literature. By using a regular library license, we achieved

to download 598 of them, plus only 34 false positives. Searching for the same

documents by just using the free text links provided by PubMed, there were only

485.

The system's precision can be improved through usage, as we can backtrack

where the false positives came from and add more terms to the blacklist or the

penalty list. An improvement of recall is dicult, as the “missed” paper is rarely

noticed.

1.3.2

Extracting Layout Information from the PDF Format

A typical vector-based format like PDF or Postscript does not directly give

the position of the text. The description is not pixel-based, but a description

of lines and curves that form the text. Like that, the information about the

document stays more authentic, because scaling does not matter and a lot of the

Mining Complex Data

Search WWH ::

Custom Search

Home