Parsing PDFs Part 2 (iText 5)

Extracting text with PdfReaderContentParser and PdfTextExtractor

Figure 15.7 shows two pages—the preface from the first edition of iText in Action. The PDF was extracted from the eBook version of the topic. It’s a traditional PDF without structure.

Let’s try to convert the content from these two pages to a plain text file.

Preface from the first edition

Figure 15.7 Preface from the first edition

SIMPLE TEXT EXTRACTION

The next example shows how to use SimpleTextExtractionStrategy in combination with PdfReaderContentParser to create a plain text file with the content of the preface.

Listing 15.25 ExtractPageContent.java

Listing 15.25 ExtractPageContent.java


The PdfReaderContentParser uses the PdfContentStreamProcessor internally. The processContent() method performs the same actions you did in listing 15.23, saving you a handful of lines of code.

The SimpleTextExtractionStrategy class is a special implementation of the Ren-derListener. It stores all the TextRenderInfo snippets in the order they occur in the stream, but it’s intelligent enough to detect which snippets should be combined into one word, and which snippets should be separated with a space character.

This TextExtractionStrategy object, containing all the text of a specific page, is returned by the processContent() method. When you get the resulting text of the first page of the Preface, it starts like this:

xix

preface

I have lost count of the number of PCs I have worn out since I started my career as a software developer—but I will never forget my first computer.

I was only 12 years old when I started programming in BASIC. I had to learn English at the same time because there simply weren’t any books on computer programming in my mother tongue (Dutch). This was in 1982. Windows didn’t exist yet; I worked on a TI99/4A home computer from Texas Instruments. When I told my friends at school about it, they looked at me as if I had just been beamed down from the Starship Enterprise.

The first text element in the content stream is "xix", the Roman page number that appears at the bottom of the page. The fact that the rest of the text reads correctly is a coincidence. It’s not necessary for an application to put all the paragraphs in the correct order.

LOCATION-BASED TEXT EXTRACTION

Let’s change one line in listing 15.25:

Listing 15.26 ExtractPageContentSorted1.java

Listing 15.26 ExtractPageContentSorted1.java

The LocationTextExtractionStrategy class will accept all the TextRenderInfo objects from the processor, just like the simple text-extraction strategy, but it will sort all the snippets of text based on their position on the page, before creating the resultant text.

The next example makes this code even more compact by using the PdfTextEx-tractor class.

Listing 15.27 ExtractPageContentSorted2.java

Listing 15.27 ExtractPageContentSorted2.java

Listings 15.26 and 15.27 have the same output. If you look at the resulting text file, you’ll see that it starts with the word "preface", and that the page number has moved to the middle:

tmp404-427_thumb[1][2]

The strings "xix" and "xx" are page numbers; "PREFACE" is a running header. In tagged documents, these elements would have been referred to as artifacts. Screen readers would have ignored these snippets of text because they are not part of the actual content. When parsing our preface, it would be nice to add a filter that removes the page numbers and headers from the resulting text.

USING RENDER FILTERS

The special FilteredTextRenderListener text-extraction strategy combines a normal TextExtractionStrategy implementation with one or more render filters. The next listing uses a subclass of the abstract RenderFilter class, named RegionText-RenderFilter.

Listing 15.28 ExtractPageContentArea.java

Listing 15.28 ExtractPageContentArea.java

In this listing, you create a Rectangle whose dimensions are chosen in such a way that the page numbers and the running headers are outside the rectangle. You then use this rectangle to create a RegionTextRenderFilter. This filter will examine all the text and images that are processed and ignore everything that falls outside the chosen area.

NOTE The rect object is currently not an instance of com.itextpdf.text.

Rectangle; it’s a java.awt.Rectangle (internally, a java.awt.geom.

Rectangle2D object is used). This may change in the future; the API of the

PDF parsing functionality hasn’t been finalized yet.

The filter is combined with a text-extraction strategy in a FilteredTextRenderListener object, and from there on the code is similar to the code in listing 15.27, with the exception that you now pass a custom strategy as a parameter for the getTextFromPage() method. The result is the preface text without page numbers and running headers.

Finding text margins

The goal of parsing the content of a page isn’t always to retrieve text. A frequently asked question involves finding the position where the last line of text ends on a page, so that extra text can be added. This can be done using a special RenderLis-tener implementation.

Figure 15.8 shows the same pages as figure 15.7, but with bounding rectangles for the text added.

The positions needed to draw these rectangles were retrieved using a TextMar-ginFinder:

Listing 15.29 ShowTextMargins.java

Listing 15.29 ShowTextMargins.java

Finding the location of text in existing PDFs

Figure 15.8 Finding the location of text in existing PDFs

Note that only text is taken into account. Graphics, such as the line that is drawn under the title "preface," are ignored by the parser in its current version. The content stream processor only returns objects of type TextRenderInfo and ImageRenderInfo.

Extracting images

Just like TextRenderInfo gives you information about a snippet of text, ImageRender-Info will give you info about an image: the position of the image and an instance of the PdfImageObject class that encapsulates the image XObject dictionary and the raw image bytes. The next listing processes all the pages of a PDF document and uses a custom ImageRenderListener to extract the images to a file.

Listing 15.30 Extractlmages.java

Listing 15.30 Extractlmages.java

The following example shows a special implementation of the RenderListener to extract images. The methods you implemented in the custom text render listener in listing 15.24 are left empty. In this case, you’re not interested in the text; only the ren-derImage() method is implemented.

Listing 15.31 MylmageRenderListener.java

Listing 15.31 MylmageRenderListener.javaListing 15.31 MylmageRenderListener.java

In listing 15.31, the filename that is chosen for each image has a reference to the indirect object number of the image stream. The bytes of image streams with the filter /DCTDE-CODE or /JPXDECODE will be written to a file as is, resulting in valid JPEG and JPEG2000 files. For the other types of images, you also need to inspect the stream dictionary for values such as the number of bits per component, the color space, the width and the height, and so on. The getBufferedImage() method will attempt to do this in your place, and return an instance of java.awt.image.BufferedImage. But when you try this example on your own system, you’ll notice that not all images are extracted.

Please don’t report this as a bug. Not all the different types of images are supported yet. This is only a preview of new functionality that has been added to iText recently. Just like with parsing text, a best effort is done; when more types of images are supported will depend on code contributors and paying iText users.

Summary

This chapter was like a sequel to chapter 14. We continued talking about the content stream of a page, but in the first two sections we added structures that made part of the content optional or that added extra information to the content, like extra properties that belong to objects on the screen, information that improves the accessibility of the document, and structures that allow you to discover elements from the original source, such as paragraphs, lists, and tables.

To demonstrate the power of these structure elements, you’ve seen how to convert an existing PDF document to XML. This only works for PDF documents that are tagged. Other PDF documents can’t be converted to XML, but you can parse them and write the output to a plain text file. We’ve discussed the different strategies that are at play and looked at how you can extract text from a PDF, find margins, and even extract images.

In the next chapter, we’ll start by looking at image and other streams. We won’t return to content streams, but we’ll look at fonts streams and embedded files, and we’ll even look at how to integrate a Flash application into a PDF document.

Next post:

Previous post: