Using Layout Data for the Analysis of Scientific Literature - Mining Complex Data

Information Technology Reference

In-Depth Information

In order to bring the information on the web, we set up a Tomcat web server

[22], using Java servlets [23] to produce the website and to present the query

results. All queries are executed by a servlet that uses Lucene to fetch the results

from the index files and builds a new web page to display the results according

to the pre-selected schema.

1.4

Image Extraction

While the location and size of an image in a PDF document are clearly described,

there are many possibilities to store the actual image data. The images can be

encoded in JPEG, in a BitMap-like format, as a postscript description, even in a

fax format and many more. It is very bothersome to implement every single one of

those possibilities, especially as third-party products (e.g. www.pdfgrabber.de)

are available. Still, those products do not reveal the position of the pictures on

the page, so a matching of the resulting pictures to the layout data we already

collected is needed. Also, there are graphics not caught by image extractors as

they are derived from graphical commands in the PDF script language, which

is very similar to Postscript.

1.4.1

Images in PDF

There are two ways to represent images in PDF documents. The textstream

can include graphical commands to draw lines or other geometrical objects.

Alternatively, they can be stored as external object called XObject. From the

text stream an XObject can be called by using the command Do (execute the

named XObject).

This example object represents an image that can be called by entering /Im3

Do into the text stream. What happens then is that the object called Im3 is

identified and executed. From the object dictionary, we can gain some informa-

tion like width (line 5) and height (line 6), although this information might not

be accurate. The true height and width are calculated and give, together with

the current position, the bounding box of the picture. To actually extract the

picture, we need the filter (as given in line 10), in this case DCTDecode, which

is the PDF name for Jpeg encoding [24].

1 22 0 obj

2 << /Type /XObject

3 /Subtype /Image

4 /Name /Im3

5 /Width 580

6 /Height 651

7 /BitsPerComponent 8

8 /ColorSpace /DeviceGray

9 /Length 31853

10 /Filter /DCTDecode>>

11 stream ... endstream endobj

Mining Complex Data

Search WWH ::

Custom Search

Home