Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

Vector based drawings are inserted in the textstream as commands like m for

move or l for line. Converting them to pixel pictures is done by using a blank

canvas and recreating all commands. To avoid inserting small decoration objects

like footnote lines or table separators, we demand a certain density of commands

in order to accept it as a full fledged image.

1.4.2

Caption Identification

By convention, images or figures in scientific literature are accompanied by cap-

tions to explain their meaning. This is very useful for the mining process, as

the figures can be searched for by a simple term search. When identifying the

captions, the first step is to look for paragraphs starting with “Fig” in any kind

of writing. This feature is very distinctive. So far, we found only one paragraph

starting with “Fig” that was not an actual figure caption.

If there is only one caption candidate on the same page, the choice for the

picture is clear. Even several pictures for one caption candidate is not a problem,

actually this happens quite often, when the figure is composed of more then one

picture (e.g. before and after). Two problems may occur though: first, there

might be more then one caption candidate (because, for instance, there is more

than one figure). Second, there might be no caption candidate, either because

the caption does not start with “Fig” or because there is none (it might be a

logo or other non-captioned image).

We solve both problems the same way. By using the layout information, gained

during the extraction process, we look for likely positions of a caption. The

general goal is to pick the candidate closest to the image, preferably below. In

order to do that, we give scores for different kinds of proximity.

We considered finding the parameters for this algorithm by machine learning

procedures, but it is hard to find a good training set, as we need scientific paper

with two or preferably more figures on at least one page, already annotated and

not uniformly produced. We found 68 so far, which all performed well with the

initial parameters, except for two cases, which were both rather special and hard

to interpret even for specialists able to deduce the connection through visually

matching caption with image. So, it seems doubtful, if machine learning would

not produce an algorithm overfitting the training set.

Candidate figure captions that could not be connected to an image are instead

linked to the whole page picture, if existing, or the first picture on the page, if

existing, or to a dummy image, so they can be indexed by the search engine and

then tracked by looking directly at the PDF. This occurs very rarely, though.

1.4.3

Case Studies for Image Extraction

To evaluate the usefulness of this new search method, we set up a case study with

colleagues from biology. There were two things to be observed: how interesting

is this method for the scientists and how well does it works in terms of eciency.

The PRODORIC database [25] contains very special data like DNA binding

sites of prokaryotic transcriptional regulators. This data is generated via specific

Mining Complex Data

Search WWH ::

Custom Search

Home