Information Technology Reference
In-Depth Information
Vector based drawings are inserted in the textstream as commands like m for
move or l for line. Converting them to pixel pictures is done by using a blank
canvas and recreating all commands. To avoid inserting small decoration objects
like footnote lines or table separators, we demand a certain density of commands
in order to accept it as a full fledged image.
1.4.2
Caption Identification
By convention, images or figures in scientific literature are accompanied by cap-
tions to explain their meaning. This is very useful for the mining process, as
the figures can be searched for by a simple term search. When identifying the
captions, the first step is to look for paragraphs starting with “Fig” in any kind
of writing. This feature is very distinctive. So far, we found only one paragraph
starting with “Fig” that was not an actual figure caption.
If there is only one caption candidate on the same page, the choice for the
picture is clear. Even several pictures for one caption candidate is not a problem,
actually this happens quite often, when the figure is composed of more then one
picture (e.g. before and after). Two problems may occur though: first, there
might be more then one caption candidate (because, for instance, there is more
than one figure). Second, there might be no caption candidate, either because
the caption does not start with “Fig” or because there is none (it might be a
logo or other non-captioned image).
We solve both problems the same way. By using the layout information, gained
during the extraction process, we look for likely positions of a caption. The
general goal is to pick the candidate closest to the image, preferably below. In
order to do that, we give scores for different kinds of proximity.
We considered finding the parameters for this algorithm by machine learning
procedures, but it is hard to find a good training set, as we need scientific paper
with two or preferably more figures on at least one page, already annotated and
not uniformly produced. We found 68 so far, which all performed well with the
initial parameters, except for two cases, which were both rather special and hard
to interpret even for specialists able to deduce the connection through visually
matching caption with image. So, it seems doubtful, if machine learning would
not produce an algorithm overfitting the training set.
Candidate figure captions that could not be connected to an image are instead
linked to the whole page picture, if existing, or the first picture on the page, if
existing, or to a dummy image, so they can be indexed by the search engine and
then tracked by looking directly at the PDF. This occurs very rarely, though.
1.4.3
Case Studies for Image Extraction
To evaluate the usefulness of this new search method, we set up a case study with
colleagues from biology. There were two things to be observed: how interesting
is this method for the scientists and how well does it works in terms of eciency.
The PRODORIC database [25] contains very special data like DNA binding
sites of prokaryotic transcriptional regulators. This data is generated via specific
 
Search WWH ::




Custom Search