Finding and replacing image and font streams (iText 5)

When you create an image using the Image class, or a font using the Font or Base-Font class, you don’t have to worry about the way these objects are stored in the finished document. For example, when you use a standard Type 1 font, iText will add a font dictionary to the PDF file. When you use a font that is embedded, the font dictionary will also refer to a stream with a full or partial font program that is copied into the PDF file.

In this section, we’ll look at advanced techniques that address the lowest level of PDF creation and manipulation with iText. The examples that follow were inspired by questions that were posted to the mailing list.

Adding a special ID to an Image

In the previous chapter, you learned how to extract all the images from a page, but what if you want to pick one specific image programmatically?

An image is a stored in a stream object. Each stream consists of a dictionary followed by zero or more bytes bracketed between the keywords stream and endstream (see table 13.2). The entries of the stream dictionary are filled in by iText. In the case of images, you’ll have at least entries for the width and the height of the image, and a value defining the compression filter, but there’s no reference to the original filename. The original bits and bytes of the image may have been changed completely.

One of the mailing-list subscribers wanted to solve the problem of retrieving specific images by adding an extra entry to the image stream dictionary. Listing 16.1 was written in answer to his question.


Listing 16.1 Specialld.java

Listing 16.1 Specialld.java

You create an instance of the high-level Image object, and set some properties, as described in chapter 2.

You use this Image object to create a low-level PdfImage object. This object extends the PdfStream class. With the second parameter, you can pass a name for the image; the third parameter can be used for the reference to a mask image.

PdfStream extends PdfDictionary. Just like with plain dictionaries, you can add key-value pairs. In this case, you choose a name for the key using the prefix reserved for iText (ITXT): ITXT_SpecialId. The value of the entry is also a name of your choice, in this case /123456789.

You add the stream object to the body of the file that is written by the PdfWriter object. The addToBody() method returns a PdfIndirectObject. Because it’s the first element that’s added to the writer in this example, the reference of this object will be 1 0 R.

You tell the Image object that it has already been added to the writer with the method

setDirectReference().

Finally, you add the image to the document. The image bytes have already been written to the OutputStream in Q. Line Q writes the Do operator and its operands to the content stream of the page, and adds the correct reference to the image bytes Q to the page dictionary.

This example unveils the mechanism that’s used by iText internally to add streams.

You’ll use the PDF file that was created by listing 16.1 in the next example. You’ll search for an image with the special ID /123456789, and you’ll replace it with another image that has a lower resolution.

Resizing an image in an existing document

Here’s another question that is often posted to the mailing list: "How do I reduce the size of an existing PDF containing lots of images?" There are many different answers to this question, depending on the nature of the PDF file. Maybe the same image is added multiple times, in which case passing the PDF through PdfSmartCopy could result in a serious file size reduction. Maybe the PDF wasn’t compressed, or maybe there are plenty of unused objects. You could try to see if the PdfReader’s removeU-nusedObjects() method has any effect.

It’s more likely that the PDF contains high-resolution images, in which case the original question should be rephrased as, "How do I reduce the resolution of the images inside my PDF?" To achieve this, you should extract the image from the PDF, downsample it, then put it back into the PDF, replacing the high-resolution image.

The next example uses brute force instead of the PdfReaderContentParser to find images. With the getXrefSize() method, you get the highest object number in the PDF document, and you loop over every object, searching for a stream that has the special ID you’re looking for.

Listing 16.2 ResizeImage.java

Listing 16.2 ResizeImage.javaListing 16.2 ResizeImage.java

Once you’ve found the stream you need, you create a PdfImageObject that will create a java.awt.image.BufferedImage named bi; you’ll create a second BufferedImage named img that is a factor smaller. In this example, the value of FACTOR is 0.5. You draw the image bi to the Graphics2D object of the image image using an affine transformation that scales the image down with a factor of FACTOR.

You write the image as a JPEG to a ByteArrayOutputStream, and use the bytes from this OutputStream as the new data for the stream object you’ve retrieved from PdfReader. You reset all the entries in the image dictionary and add all the keys that are necessary for a PDF viewer to interpret the image bytes correctly. After changing the PRStream object in the reader, you use PdfStamper to write the altered file to a FileOutputStream. Again, you get a look at the way iText works internally. When you add a JPEG to a document the normal way, iText selects all the entries for the image dictionary for you.

Working at the lowest level is fun and gives you a lot of power, but you really have to know what you’re doing, or you can seriously damage a PDF file. Because of the high complexity, some requirements are close to impossible. For instance, it’s very hard to replace a font. Let’s start by finding a way to list the fonts that are used in a PDF document.

Listing the fonts used

In listing 11.1, you created a PDF document demonstrating different font types. You can now use listing 16.3 to inspect this document and create a set containing all the fonts that were used. This time you won’t look at every object in the PDF, as done in the previous listing—even those that weren’t relevant. This time you’ll process the resources of every page in the document.

Listing 16.3 ListUsedFonts.java

Listing 16.3 ListUsedFonts.javaListing 16.3 ListUsedFonts.java

In this listing, you check for a series of keys in the font descriptor dictionary to determine the font type O. Table 16.1 explains which key corresponds with which font type.

Table 16.1 Stream references in the font descriptor

Key

Description

tmp404-439

The value for this key (if present) is a stream containing a Type 1 font program.

tmp404-440

The value for this key (if present) is a stream containing a TrueType font program.

tmp404-441

The value for this key (if present) is a stream containing a font program whose format is specified by the /Subtype entry in the stream dictionary. It can be /Type1C, /CIDFontType0C,or/OpenType.

If you try this example on the file created in chapter 11, you’ll get the following result:

tmp404442_thumb

The standard Helvetica Type 1 font isn’t embedded, and there’s no font descriptor. The same goes for the KozMinPro-Regular CJK font. Embedded Type 1 fonts are always fully embedded by iText. TrueType and OpenType fonts are subsetted unless you changed the default behavior with the setSubset() method. This was explained in chapter 11.

Observe that there are two entries of ArialMT. This is caused by the use of two variations of the Arial font: one using WinAnsi encoding and one using Identity-H. You can’t store both types of the font in the same font dictionary and stream; two different font objects with different names will be created. In this case, the font names are WTB-BZY+ArialMT and XKYIQK+ArialMT. The six-letter code is chosen at random and will change every time you execute the example.

FAQ Can I combine different subsetted fonts into one font? The easy answer is "no." The not-so-easy answer is that merging subsets is really hard. It may require the page content of all the pages to be rewritten.

In the next example, you’ll replace a font that isn’t embedded with a fully embedded font. This will give you an idea of the difficulties you can expect if you ever try to combine different subsetted fonts into one.

Replacing a font

Figure 16.1 shows two PDF files that were created in the very same way, except for one difference: in the upper PDF, the font (Walt Disney Script v4.1) wasn’t embedded. It’s a font I downloaded from a site with plenty of free fonts. The font isn’t installed on my OS, so Adobe Reader doesn’t find it, and the words "iText in Action" are shown in Adobe Sans MM, which is quite different from the font shown in the PDF that has the font embedded.

Non-embedded versus embedded fonts

Figure 16.1 Non-embedded versus embedded fonts

Suppose you have the upper PDF as well as the font file for the Walt Disney Script font. You could use this listing to embed that font after the fact.

Listing 16.4 EmbedFontPostFacto.java

Listing 16.4 EmbedFontPostFacto.javaListing 16.4 EmbedFontPostFacto.java

In this listing, you’re adding the complete font file. You add the reference to the stream using the FONTFILE2 key because you know in advance that the font has TrueType outlines. That’s not the only assumption you make. You also assume that the metrics of the font that is used in the PDF correspond to the metrics of the new font you’re embedding.

When we talked about parsing PDFs, I explained that we could only make a fair attempt, but that the functionality could fail for PDFs using exotic encodings. Several warnings that were mentioned in section 15.3.1 also apply here. In real-world examples, replacing one font with another can be very difficult.

Now that you know what a PDF looks like on the inside, these examples complement your knowledge about images (discussed in chapter 10) and fonts (chapter 11). In the sections that follow, we’ll take a close look at annotations (chapter 7) that are associated with a PDF stream.

Next post:

Previous post: