Parsing PDFs Part 1 (iText 5)

The first edition of iText in Action had a section named "Why iText doesn’t do text extraction." It was preceded by an example that demonstrated how to retrieve the content stream of a page using the getPageContent() method, just like you did in section 14.1. The simple Hello World example from chapter 1 resulted in the following stream:

tmp404388_thumb

The PDF string (Hello World!) followed by the text operator Tj is visible in clear text. Surely it must be possible to write some code to extract that string? When the first edition was written, the only way to achieve this was by using the PRTokeniser class (mind the British s in the name, instead of the American z).

In this section, we’ll learn how iText has evolved, and find out how to parse the content of PDF content streams to retrieve text and images.

Examining the content stream with PRTokeniser

With PRTokeniser, you can split a PDF content stream into its most elementary parts. Each part has a specific type. The possible types, shown in table 15.2, are enumerated in the enum named TokenType.


The TaggedPdfReaderTool class fetches the /StructTreeRoot object from the catalog. Then it recursively inspects all the children of the tree:

tmp404389_thumb

Table 15.2 Overview of the token types

TokenType

Symbol

Description

tmp404-390 tmp404-391

The current token is a number.

tmp404-392 tmp404-393

The current token is a string.

tmp404-394 tmp404-395

The current token is a name.

tmp404-396 tmp404-397

The current token is a comment.

tmp404-398 tmp404-399

The current token starts an array.

tmp404-400 tmp404-401

The current token ends an array.

tmp404-402 tmp404-403

The current token starts a dictionary.

tmp404-404 tmp404-405

The current token ends a dictionary.

tmp404-406 tmp404-407

The current token ends a reference.

tmp404-408 tmp404-409

The current token is probably an operator.

tmp404-410 tmp404-411

There are no more tokens.

This listing shows the simplest PDF parser one could write. It gets the page content of page 1, passes the content to a PRTokeniser object, and writes all the tokens with TokenType.STRING to a PrintWriter.

Listing 15.20 ParsingHelloWorld.java

Listing 15.20 ParsingHelloWorld.java

If you try this example with your first Hello World example, you’ll have a very good result:

tmp404413_thumb

But as soon as you have more complex PDF files, this simple parser won’t work. Listing 15.21 creates a PDF file with the text "Hello World", but those words are added in different parts: first "ld", then "Wor", then "llo", and finally "He". Because of the choice of coordinates, the text reads "Hello World" when opened in a PDF viewer. It also adds the text "Hello People" as a form XObject.

PRTokeniser offers the strings in the order they appear in the content stream, not in the order they are shown on the screen. Moreover, the text "Hello People" is missing because it’s not part of the content stream. It’s inside an external object that is referred to from the page dictionary.

Even if all the characters are in the right order, there may be kerning information between substrings, adjusting the space between letters so they look better (for instance between the two letter ls of the word "Hello"). However, the spacing can also be used instead of a whitespace character. That’s one aspect that should be considered and that makes it difficult to extract text from a content stream.

Another aspect is the encoding. It’s possible for a PDF to have a font containing characters that appear in a content stream as a, b, c, and so on, but for which the shapes drawn in the PDF file show a completely different glyph, such as a, P, y, and so on. An application can create a different encoding for each specific PDF document— for example, in an attempt to obfuscate. More likely, the PDF-generating software does this deliberately, such as when a font with many characters is used but all the text can be shown using only 256 different glyphs. In this case, the software picks character names at random according to the glyphs that are used. Another possibility is that the content stream consists of raw glyph indexes; you then have to write code that goes through the character mappings and finds the right letters.

Listing 15.21 PaisingHelloWorld.java

Listing 15.21 PaisingHelloWorld.java

When you use the simple parser from listing 15.20, you’ll get the following output:

tmp404415_thumb

You’ll also encounter PDF files that were created from scanned images. The content stream of each of the pages in such a document contains a reference to an image XObject. There will be no PDF strings in the stream. In the previous chapter, you created PDF documents with the glyphs drawn by the Java TextLayout object, and you wouldn’t find any strings in this case either. Optical character recognition (OCR) will be your only recourse if you want to extract text from such a PDF document.

The section about text extraction in the first edition was followed by a section entitled "Why you shouldn’t use PDF as a format for editing." Again, an example and a list of reasons was given for why it’s extremely difficult and not very wise to edit a content stream. But that was then, and this is now. It’s still true that you shouldn’t edit a PDF, but with regards to text extraction, we’ve welcomed a new iText developer, Kevin Day, who has contributed a package (com.itextpdf.text.pdf.parser) containing classes that are able to parse and interpret PDF content.

WARNING The API of this package is subject to change, because other developers—including myself—are still experimenting with it, adding new features, and fixing bugs.

Given the different obstacles I’ve outlined, not every PDF document that can be found in the wild can be parsed effectively, but the functionality does make a good effort at trying to find words and sentences, even if they’re drawn on a page in random order, as was the case with our second "Hello World" example.

Processing content streams with PdfContentStreamProcessor

If you look at the com.itextpdf.text.pdf.parser package, you’ll find utility classes such as ContentByteUtils with static methods to extract byte arrays from a PDF file, and tools such as PdfContentReaderTool with methods to create a String representation of objects and to output lists of objects and contents. For instance,

tmp404416_thumb

This code snippet will write all the information that is needed to extract the content of a page, including the extracted text.

The next listing gives an idea of what to expect. Note that the content streams are replaced by ellipses (…).

Listing 15.22 calendar_info.txt generated with InspectPageContent.java

Listing 15.22 calendar_info.txt generated with InspectPageContent.javaListing 15.22 calendar_info.txt generated with InspectPageContent.java

This is the first step toward text extraction: collecting all the resources. Now you need to process the information. Listing 15.23 shows a new version of parsePdf() from listing 15.20. The PRTokeniser class is still used, but its complexity is hidden by the Pdf-ContentStreamProcessor class.

Listing 15.23 ParsingHelloWorld.java

Listing 15.23 ParsingHelloWorld.java

The output of this listing depends on the listener. This is an instance of the Render-Listener interface to which the processor passes information about the text and images in the page. The following listing is an experimental implementation that will help you understand the mechanism.

Listing 15.24 MyTextRenderListener.java

Listing 15.24 MyTextRenderListener.javaListing 15.24 MyTextRenderListener.java

You’re not concerned with images yet. Angle brackets are placed at the start and end of text blocks, and every text segment is enclosed in angle brackets. If you use this method on the PDF created with listing 15.21, you’ll get the following results:

tmp404422_thumb

The words "Hello World" are still mangled, but the text "Hello People" is picked up correctly.

In listing 15.24, you use the class TextRenderInfo to get a chunk of text with the getText() method, but the render info class also provides methods to get LineSeg-ment objects containing information about the location of the text on the page, to get the font that was used, and so on. With this information, you could write a RenderLis-tener implementation that returns a result that is much better than the output provided by MyTextRenderListener.

Fortunately, this has already been done for you in the form of text-extraction strategies. The TextExtractionStrategy interface extends RenderListener, adding a getResultantText() method. The different implementations of this interface, in combination with the PdfReaderContentParser or PdfTextExtractor, dramatically reduce the number of code lines needed to extract text.

Next post:

Previous post: