Adding metadata (iText 5)

There are two ways to store metadata inside a PDF document. The original way was to store a limited number of keys and values in a special dictionary; a newer way is to embed the data as an XML stream inside the PDF. Let’s discuss both to find out the difference.

The info dictionary

In figure 12.1, the document properties from the Hello World example you made in topic 1 are compared to a new Hello World example with metadata added.

Metadata in PDF files

Figure 12.1 Metadata in PDF files

Listing 12.1 MetadataPdf.java

Listing 12.1 MetadataPdf.java

This code snippet adds the title of the document, its author, the subject, some keywords, and the application that was used to create the PDF as metadata. If you look inside the PDF, you see that this information is stored in a dictionary, named the info dictionary, along with the creation date, modification date, and PDF producer. This is the limited set of metadata key-value pairs that is supported in PDF.


Three metadata entries are filled in automatically by iText (and you can’t change them). If you create a PDF from scratch, iText will use the time on the clock of your local computer as the creation and modification date. If you manipulate a PDF with PdfStamper, only the modification date will be changed. The same goes for the producer name.

Listing 12.2 MetadataPdf.java

Listing 12.2 MetadataPdf.java

With the getInfo() method, you can retrieve the keys and values as Strings. You can add, remove, or replace entries in the HashMap, and put the altered metadata in the PDF using setMoreInfo().

FAQ Can I change the producer info ? The value for the PDF producer tells you which version of iText was used to create the document. It’s also a way to tell the end users of the document that iText was used to create it. You can’t change this without breaking the software license that allows you to use iText for free.

A dictionary is a PDF object, and the values that are stored in this dictionary are also PDF objects. PDF viewers such as Adobe Reader don’t have any problem interpreting these objects, but applications that aren’t PDF-aware can’t find or read this meta-information. The Extensible Metadata Platform (XMP) was introduced to solve this problem.

The metadata shown in the window to the right was added using this code:

The Extensible Metadata Platform (XMP)

The Extensible Metadata Platform provides a standard format for the creation, processing, and interchange of metadata. An XMP stream can be embedded in a number of popular file formats (TIFF, JPEG, PNG, GIF, PDF, HTML, and so on) without breaking their readability by non-XMP-aware applications.

The XMP specification defines a model that can be used with any defined set of metadata items. It also defines particular schemas; for instance, the Dublin Core schema provides a set of commonly used properties such as the title of the document, a description, and so on. For PDF files, there’s a PDF schema with information about the keywords, the PDF version, and the PDF producer. This way, an application that can’t interpret PDF syntax can still extract the metadata from the file by detecting and parsing the XML that is embedded inside the PDF. What follows is an example of such an XMP metadata stream.

Listing 12.3 xmp.xml

Listing 12.3 xmp.xml

This stream was created with iText using the XmpWriter class. The following bit of code shows how to add an XMP stream as metadata.

Listing 12.4 MetadataXmp.java

Listing 12.4 MetadataXmp.java

You use the byte[] created with XmpWriter with the setXmpMetadata() method to add the stream to the PdfWriter. This XMP stream covers the complete document. It’s also possible to define an XML stream for individual pages. In that case you need to use the setPageXmpMetadata() method.

You can delegate the creation of the XMP stream to iText. Just create the metadata as done in listing 12.1, and add the following line:

tmp40432_thumb

Suppose you have a PDF file that only contains metadata in an info dictionary. In that case, you can use the following to add an XMP stream.

Listing 12.5 MetadataXmp.java

Listing 12.5 MetadataXmp.java

Extracting the XMP metadata from an existing PDF is done using the getMetadata() method on a PdfReader instance.

Tools or applications that aren’t PDF-aware will search through the file for an xpacket with the id shown in listing 12.3, so it’s important that the stream containing the XMP metadata is never compressed.

Next post:

Previous post: