Copying pages with PdfCopy (iText 5)

In the previous section, each PdfStamper object was associated with one and only one PdfReader object. As soon as you want to assemble pages from more than one document, you should use another PDF manipulation class: PdfCopy.

PdfCopy extends PdfWriter, and you’ll immediately recognize the five steps in the PDF creation process:

Listing 6.20 SelectPages.java

Listing 6.20 SelectPages.java Listing 6.20 SelectPages.java

The main difference between these five steps and the ones from topic 1 is that you’re now using PdfCopy instead of PdfWriter in step 2. You can only add content using addPage(). Listing 6.20 is a variation on listing 6.11, with only one document being involved in this example. Let’s extend the example and concatenate two PDFs.


Concatenating and splitting PDF documents

In topic 2, we created a list with movies containing links to the Internet Movie Database (IMDB). We also created a historical overview of these movies with bookmarks that were generated automatically. Now let’s combine those two PDFs into one new document.

Listing 6.21 Concatenate.java

Listing 6.21 Concatenate.java

MovieLinks1.RESULT is a document with 34 pages. MovieHistory.RESULT has 26 pages. The page count of the concatenated file is 60.

FAQ After merging two PDFs, I’m seeing unnecessary white space. Why are there so many blank areas? Sometimes people expect that a document with one page concatenated with another document counting one page will result in a document with only one page. They expect that, when the pages of the original document are only half full, the new document will put both halves on one page. That’s not how PDF works! In PDF, you work with complete pages; it’s not possible to reflow the content on those pages.

There are two different versions of the addPage() method. You can add blank pages if you use a Rectangle and a rotation value as parameters, or you can add a PdfImport-edPage obtained from the same PdfCopy instance using getImportedPage().

PRESERVATION OF INTERACTIVE FEATURES

You’ve used imported pages with PdfWriter in section 6.2 and with PdfStamper in section 6.3. You’ve scaled these imported pages, rotated them, and so on. All of this isn’t possible with the PdfImportedPage objects obtained from PdfCopy. You can only add them to a new document in their original form and size.

This limitation comes with a major advantage: most of the interactive features of the page are preserved. The links that are present in MovieLinks1.RESULT are lost if you import a page using PdfWriter or PdfStamper, but they still work if you import the same page with PdfCopy. Links are a special type of annotation, and we’ll discuss the different types of annotations in topic 7. For now, it’s sufficient to know that all annotations are kept with PdfCopy. The bookmarks of MovieHistory.RESULT, on the other hand, are lost.

We’ll find a way to work around this in the next topic.

ADDING CONTENT WITH PDFCOPY

In previous sections, I explained that PdfImportedPage is a read-only subclass of PdfTemplate. You can’t add any content to an imported page. This wasn’t a big deal when using imported pages with PdfWriter and PdfStamper because we could easily add content over or under the imported page. When using PdfCopy, it would be interesting if we could somehow add extra content too.

It would be interesting if we could add a "page X of Y footer that reflects the new page numbers.

Listing 6.22 ConcatenateStamp.java

Listing 6.22 ConcatenateStamp.java tmp89-32_thumb[1]

With PdfCopy, we can add content to a PdfImportedPage using a PdfCopy.PageStamp object. Such an object can be obtained with the createPageStamp() method O. This object has two methods for getting a direct content layer: getUnderContent() and getOverContent(). These methods return a PdfCopy.StampContent object. PdfCopy. StampContent extends PdfContentByte, and you can use itjust as you’d use any other PdfContentByte object. In listing 6.22, you use it to add text at an absolute position ©. There’s one caveat: you mustn’t forget to invoke the alterContents() method ©.

SPLITTING A PDF

Using a PdfReader instance with PdfCopy doesn’t tamper the reader the way PdfStamper does. You can reuse the same reader object for different PdfCopy objects. You can, for instance, construct one reader instance that reads the timetable PDF from topic 3, and create a new PdfCopy instance for every page to split the document into individual pages. In PDF terminology, this process is often called PDF bursting.

Listing 6.23 Burst.java

Listing 6.23 Burst.java

The original file representing the timetable contained 8 pages, and its size was about 15 KB. Bursting this file results in 8 different single-page documents, each with a file size of about 4 KB. 8 times 4 KB is 32 KB, which is more than the original 15 KB, because resources that were shared among pages in the original document are now copied into each separate document. So you might wonder what would happen if you concatenated PDF documents containing duplicate content.

PdfCopy versus PdfSmartCopy

In section 6.3.5, you filled out and flattened the film data sheet form to create a separate file for movies made in the year 2007. Wouldn’t it be nice to create one single document that contains the data sheets for all the movies in the database?

Here you’ll fill the data sheet using PdfStamper. The resulting PDF files will be kept in memory just long enough to copy the page into a new document with PdfCopy.

Listing 6.24 DataSheets1.java

Listing 6.24 DataSheets1.java

This example works perfectly, and at first sight you won’t find anything wrong with the resulting PDF when you open it in Adobe Reader. Only when you look at the file size will you have doubts. The original datasheet.pdf was less than 60 KB, but the resulting PDF is almost 5 MB.

This document has 120 pages that are almost identical. Only the specific movie information differs from page to page; the form template is repeated over and over again. But PdfCopy isn’t aware of that: it takes every page you add, including its resources, and copies everything to the writer. The code in listing 6.24 adds the same bits and bytes representing the original form to the same document 120 times. The resulting PDF is full of redundant information.

This can be avoided by using PdfSmartCopy instead of PdfCopy in step 2.

Listing 6.25 DataSheets2.java

Listing 6.25 DataSheets2.java Listing 6.25 DataSheets2.java

Now the size of the resulting PDF file is only about 300 KB; that’s a much better result.

PdfSmartCopy extends PdfCopy. It inherits the same functionality, but it checks every page that’s added for redundant objects, so it can save plenty of disk space or bandwidth. There’s a price to pay for this extra "intelligence." PdfSmartCopy needs more memory and more time to concatenate files than PdfCopy. It will be up to you to decide what’s more important: file size and bandwidth, or memory and time. It will also depend on the nature of the documents you want to concatenate. If there is little resemblance between the pages, you might as well use PdfCopy. If different documents all have the same company logo on every page, you might want to consider using PdfSmartCopy to detect that logo.

In this example, you’ve concatenated flattened forms. But what happens if you concatenate the original forms? You don’t have to try this: it won’t work. Although PdfCopy (and PdfSmartCopy) preserve the annotations used to visualize a form, the form functionality will be broken if you try to concatenate two or more documents containing forms using PdfCopy. Your best chance to achieve this is to use PdfCopyFields.

Concatenating forms

Suppose you want to create a film data sheet form with two or more pages. This can easily be done with only four lines of code.

NOTE These examples will only work if your forms are created using Acro-Form technology. It’s not possible to concatenate XFA forms using iText.

Listing 6.26 ConcatenateForms1.java

Listing 6.26 ConcatenateForms1.java

DATASHEET refers to the file datasheet.pdf. RESULT refers to a new form with two identical pages. This form probably won’t work the way you expect it to. You probably want to be able to enter the information about one movie on the first page, and about another movie on the second page. That’s impossible with this form. Although the field "title" is physically present in two different locations in the same document, there’s only one logical field with the name "title" in the form. This single field can only have one value. If you enter a title on page one, you’ll see the same title appear on page two. That may not be your intention; you probably want to create a form with two pages that can be used to enter information about two different movies.

Listing 6.27 ConcatenateForms2.java

Listing 6.27 ConcatenateForms2.java

This code snippet renames fields such as "title" into "title_1" (on page 1) and "title_2" (on page 2). Now there’s no longer a conflict between the field names on the different pages.

NOTE Don’t use PdfCopyFields to concatenate PDF documents without form fields. As opposed to concatenating documents using PdfCopy, PdfCopyFields needs to keep all the documents in memory to update the combined form. This can become problematic if you’re trying to concatenate large documents.

The PdfCopyFields example completes this topic on the different PDF manipulation classes. It’s high time for a summary with an overview that will help you pick the right class for the job.

Summary

In this topic, you’ve been introduced to the different PDF manipulation classes available in iText. You’ve used these classes to solve a series of common problems: N-up copying and tiling PDF documents, using a PDF as company stationery, adding headers, footers, watermarks, and "page X of Y" to existing documents, concatenating and splitting PDFs, and so on.

Every class had its specific specialties and limitations. Table 6.1 gives an overview of the classes that were discussed in this topic.

That’s only possible if you use forms with different field names, or if you rename the fields.

Table 6.1 An overview of the PDF manipulation classes

iText class

Usage

tmp89-39

Reads PDF files. You pass an instance of this class to one of the other PDF manipulation classes.

tmp89-40

A read-only subclass of PdfTemplate. Can be obtained from a PDF manipulation class using the method getImportedPage() .

tmp89-41

Generates PDF documents from scratch. Can import pages from other PDF documents. The major downside is that all interactive features of the imported page (annotations, bookmarks, fields, and so forth) are lost in the process.

tmp89-42

Manipulates one (and only one) PDF document. Can be used to add content at absolute positions, to add extra pages, or to fill out fields. All interactive features are preserved, except when you explicitly remove them (for instance, by flattening a form).

tmp89-43

Copies pages from one or more existing PDF documents. Major downsides: PdfCopy doesn’t detect redundant content, and it fails when concatenating forms.

tmp89-44

Copies pages from one or more existing PDF documents. PdfSmartCopy is able to detect redundant content, but it needs more memory and CPU than PdfCopy.

tmp89-45

Puts the fields of the different forms into one form. Can be used to avoid the problems encountered with form fields when concatenating forms using PdfCopy. Memory use can be an issue.

In the next topic, we’ll focus mainly on PdfStamper. I’ll introduce the concept of annotations, and you’ll learn that form fields are a special type of annotation. You’ll create a form from scratch using iText, and we’ll discuss the different types of interactive forms in PDF.

Next post:

Previous post: