Accessing an existing PDF with PdfReader (iText 5)

First, we’ll look at how you can retrieve information about the document you’re going to manipulate. For instance, how many pages does the original document have? Which page size is used? All of this is done with a PdfReader object.

Retrieving information about the document and its pages

In this first example, we’ll inspect some of the PDF documents you created in part 1. You can query a PdfReader instance to get the number of pages in the document, the rectangle defining the media box, the rotation of the page, and so on.

Listing 6.1 Pagelnformation.java

Listing 6.1 Pagelnformation.javatmp17C-245_thumb


The following output was obtained while inspecting some of the PDFs from topics 1 O and ©, 3 ©, and 5 ©.

tmp17C-246_thumbtmp17C-247_thumb

The most important PdfReader methods you’ll use in this topic are getNumberOf-Pages() and getPageSizeWithRotation(). The former method will be used to loop over all the pages of the existing document; the latter is a combination of the methods getPageSize() and getPageRotation().

PAGE SIZE

The first two examples show the difference between creating a document with landscape orientation using

tmp17C248_thumb

and a document created using

tmp17C249_thumb

This difference will matter when you import a page or when you stamp extra content on the page. Observe that in example © of the earlier output, the coordinates of the lower-left corner are different from (0,0) because that’s how I defined the media box in section 5.3.1.

BROKEN PDFS

When you open a corrupt PDF file in Adobe Reader, you can expect the message, "There was an error opening this document. The file is damaged and could not be repaired." PdfReader will also throw an exception when you try to read such a file. You can get an InvalidPdfException with the following message: "Rebuild failed: trailer not found; original message: PDF startxref not found." If that happens, iText can’t do anything about it: the file is damaged, and it can’t be repaired. You’ll have to contact the person who created the document, and ask him or her to create a version of the document that’s a valid PDF file.

In other cases, for example if a rogue application added unwanted carriage return characters, Adobe Reader will open the document and either ignore the fact that the PDF isn’t syntactically correct, or will show the warning "The file is damaged but is being repaired" very briefly. PdfReader can also overcome small damages like this. No alert box is shown, because iText isn’t necessarily used in an environment with a GUI. You can use the method isRebuilt() to check whether or not a PDF needed repairing. You may also have difficulties trying to read encrypted PDF files.

ENCRYPTED PDFS

PDF files can be protected by two passwords: a user password and an owner password. If a PDF is protected with a user password, you’ll have to enter this password before you can open the document in Adobe Reader. If a document has an owner password, you must provide the password along with the constructor when creating a PdfReader

instance, or a BadPasswordException will be thrown. More details about the different ways you can encrypt a PDF document, and about the different permissions you can set, will follow in topic 12.

Reducing the memory use of PdfReader

In most of this topic’s examples, you’ll create an instance of PdfReader using a String representing the path to the existing PDF file. Using this constructor will cause PdfReader to load plenty of PDF objects (from the file) into Java objects (in memory). This can be overkill for large documents, especially if you’re only interested in part of the document. If that’s the case, you can choose to read the PDF only partially.

PARTIAL READS

Suppose you have a document with 1000 pages. PdfReader will do a full read of these pages, even if you’re only interested in page 1. You can avoid this by using another constructor. You can compare the memory used by different PdfReader instances created to read the timetable PDF from topic 3:

Listing 6.2 Memorylnfo.java

Listing 6.2 Memorylnfo.java

The file size of the timetable document from topic 3 is 15 KB. The memory used by a full read is about 35 KB, but a partial read needs only 4 KB. This is a significant difference. When reading a file partially, more memory will be used as soon as you start working with the reader object, but PdfReader won’t cache unnecessary objects. That also makes a huge difference, so if you’re dealing with large documents, consider using PdfReader with a RandomAccessFileOrArray parameter constructed with a path to a file.

NOTE In part 4, you’ll see how to manipulate a PDF at the lowest level. You’ll change PDF objects in PdfReader and then save the altered PDF. For this to work, the modified objects need to be cached. Depending on the changes you want to apply, using a PdfReader instance created with a RandomAccessFileOrArray may not be an option.

Another way to reduce the memory usage of PdfReader up front is to reduce the number of pages before you start working with it.

SELECTING PAGES

Next, you’ll read the timetable from example 3 once again, but you’ll immediately tell PdfReader that you’re only interested in pages 4 to 8.

Listing 6.3 SelectPages.java

Listing 6.3 SelectPages.java

The general syntax for the range that’s used in the selectPages() method looks like this:

tmp17C252_thumb

You can have multiple ranges separated by commas, and the ! modifier removes pages from what is already selected. The range changes are incremental; numbers are added or deleted as the range appears. The start or the end can be omitted; if you omit both, you need at least o (odd; selects all odd pages) or e (even; selects all even pages).

If you ask the reader object for the number of pages before selectPages() in listing 6.3, it will tell you that the document has 8 pages. If you do the same after making the page selection, it will tell you that there are only 5 pages: pages 4, 5, 6, 7, and 8. The old page 4 will be the new page 1. Be careful not to try getting information about pages that are outside the new range. Don’t add the following line to listing 6.3:

tmp17C253_thumb

This line will throw a NullPointerException because there are no longer 6 pages in the reader object.

Now that you’ve had a short introduction to PdfReader, you’re ready to start manipulating existing PDF documents.

Next post:

Previous post: