Information Technology Reference
In-Depth Information
In order to bring the information on the web, we set up a Tomcat web server
[22], using Java servlets [23] to produce the website and to present the query
results. All queries are executed by a servlet that uses Lucene to fetch the results
from the index files and builds a new web page to display the results according
to the pre-selected schema.
1.4
Image Extraction
While the location and size of an image in a PDF document are clearly described,
there are many possibilities to store the actual image data. The images can be
encoded in JPEG, in a BitMap-like format, as a postscript description, even in a
fax format and many more. It is very bothersome to implement every single one of
those possibilities, especially as third-party products (e.g. www.pdfgrabber.de)
are available. Still, those products do not reveal the position of the pictures on
the page, so a matching of the resulting pictures to the layout data we already
collected is needed. Also, there are graphics not caught by image extractors as
they are derived from graphical commands in the PDF script language, which
is very similar to Postscript.
1.4.1
Images in PDF
There are two ways to represent images in PDF documents. The textstream
can include graphical commands to draw lines or other geometrical objects.
Alternatively, they can be stored as external object called XObject. From the
text stream an XObject can be called by using the command Do (execute the
named XObject).
This example object represents an image that can be called by entering /Im3
Do into the text stream. What happens then is that the object called Im3 is
identified and executed. From the object dictionary, we can gain some informa-
tion like width (line 5) and height (line 6), although this information might not
be accurate. The true height and width are calculated and give, together with
the current position, the bounding box of the picture. To actually extract the
picture, we need the filter (as given in line 10), in this case DCTDecode, which
is the PDF name for Jpeg encoding [24].
1 22 0 obj
2 << /Type /XObject
3 /Subtype /Image
4 /Name /Im3
5 /Width 580
6 /Height 651
7 /BitsPerComponent 8
8 /ColorSpace /DeviceGray
9 /Length 31853
10 /Filter /DCTDecode>>
11 stream ... endstream endobj
 
Search WWH ::




Custom Search