Creating a PDF from a servlet (iText 5)

Up until now, you’ve only worked with standalone examples. You compiled them using the javac command and executed them with java, resulting in one or more PDF documents.

For this topic, you need to install an application server. If you’ve written and deployed Java servlets before, you shouldn’t have any problem setting up the examples. If you don’t have any experience with J2EE applications, please consult a topic about writing web applications in Java, as this is outside the scope of this topic.

I use Tomcat in combination with Eclipse. This allows me to choose Run As > Run on Server instead of Run As > Java Application. Eclipse will start up an instance of Tomcat, and a browser window opens inside my IDE. If I’m pleased with the result, I deploy the application on my web server. See figure 9.1. The window at the lower right in the foreground is Eclipse; the windows in the background are browser windows: Firefox, Google Chrome, Microsoft Internet Explorer (MSIE).

To get to this result, you need to integrate the five steps in the PDF creation process in a servlet.

Hello World servlet opened in Eclipse, Firefox, Chrome, and MSIE


Figure 9.1 Hello World servlet opened in Eclipse, Firefox, Chrome, and MSIE

The five steps of PDF creation in a web application

When we discussed step 2 in the PDF creation process, writing a simple Hello World example to a FileOutputStream, you learned that we could have used any other Out-putStream. For instance, a ServletOutputStream obtained from the HttpServletRe-sponse with the getOutputStream() method.

Listing 9.1 Hello.java

Listing 9.1 Hello.java

The difference between this and the standalone "Hello World" example from topic 1 is that here you subclass HttpServlet and override the doGet() or doPost() method, or both. You copy and paste the five steps into this method:

■ O Create the Document.

■ © Create an instance of PdfWriter and use response.getOutputStream() for the second parameter.

■ © Open the Document.

■ © Add content.

■ © Close the Document.

This is probably the simplest iText servlet you can write.

If you want to deploy it in a web application, you have to adapt the web.xml configuration file of your application. Note that most IDEs have a wizard that updates this XML file for you. I made my web.xml file using a wizard in Eclipse.

Listing 9.2 web.xml

Listing 9.2 web.xml

You’ll put all the examples from this topic in a web application named topic. As you move on, you’ll have to add more servlet and servlet-mapping tags to this file.

You’ll use /hello.pdf as the URL pattern for your first servlet. The URL to run the servlet on the localhost will look like this: http://localhost:8080/topic/hello.pdf. You can also see the servlet in action on http://itextpdf.org:8180/topic/hello.pdf; that’s where I deployed the WAR file of the application. You can use the ANT files that come with the examples to create your own WAR file if you want to test this functionality on your own server.

The screenshots in figure 9.1 prove that this servlet works for recent versions of the most common browsers and PDF viewers, but you may experience problems that are not iText-related with specific browser and viewer combinations. How can you determine whether a problem is caused by the browser, by the server, or by (the wrong use of) iText?

Troubleshooting web applications

Let’s start with rules of thumb that can save you from a lot of frustration when trying to get your PDF servlet online. These rules may seem trivial, but they’re very important.

Always begin writing code that runs as a standalone example. If the example doesn’t work in its standalone version, it won’t work in a web application either, but at least you can rule out all problems related to the server or the browser.

■ Start with simple code samples based on the examples in this topic.. Gradually add complexity until something goes wrong. Look at the stack trace in the server logs. Most of the time, the error messages will tell you exactly what to do. If not, post the stack-trace to the iText mailing list, and don’t forget to mention what application server you’re using, as well as the Java version and the iText release number.

■ Always test your application on different machines, using different browsers, even if there isn’t any problem. Some web applications won’t show any problems when tested on one type of browser, but will fail when using another browser.

■ Create a file on the server’s filesystem if no file appears in the browser. An easy way to find out if a problem is caused by iText or by the browser is to replace the Serv-letOutputStream in step C with a FileOutputStream (for debugging reasons only). If the file is generated correctly on your server, you can rule out iText as the cause of the problem.

By following this last rule, you should be able to determine whether the problem is a client-side or a server-side problem.

SERVER-SIDE PROBLEMS

Throughout the years, I’ve compiled a list of things that can go wrong on the server side, based on what other users have posted on the mailing list.

Bad Exception handling The first thing you shouldn’t like about listing 9.1 is the way the DocumentException is handled. If something goes wrong in the try block, an IOException is thrown, resulting in an internal server error. If you’re using Tomcat, an HTML page with the header "HTTP Status 500" is sent to the browser, showing (part of) the stack trace of the exception. That’s not something you want to show to the visitors of your site. You’re probably used to providing error pages that are less technical than the one generated by Tomcat, but remember that you’re creating PDFs. If you send HTML to a PDF viewer, it will throw an error saying "the file doesn’t begin with %PDF."

■ Mixing HTML and PDF syntax Be careful not to mix HTML error messages in a stream of PDF bytes. If a PDF viewer is already opened as a browser plug-in, it will tell you that the PDF is corrupt because it can’t interpret the HTML code. The best way to debug problems like this is by saving the stream that is sent to the browser as a file. First try opening it in Adobe Reader. If it doesn’t open correctly, have a look at it in a text editor that preserves binary characters. Don’t forget to scroll down beyond the %EOF end of file marker (if possible). I’ve seen web applications that were adding a stream of plain HTML to the PDF file. Newer versions of the Adobe Reader plug-in may ignore the HTML, but older versions will complain that the file is corrupt.

■ The blank-page problem If you don’t find HTML syntax, but you see an unusual amount of question marks inside blocks marked with stream and endstream, the problem is server-related. The question marks should be binary characters. You’ll probably be able to open the PDF in the browser plug-in because the page structure of the PDF is OK, but you’ll only see blank pages because the content of the pages is corrupted. This can happen when your server flattens all bytes with a value higher than 127. Consult your web (or application) server manual to find out how to make sure binary data is sent correctly to the browser.

■ Problems with JARs For instance, a ClassNotFoundException is thrown. Check whether you have added all the JARs you need to the classpath of your web application. If an iText class is missing, make sure you don’t have more than one version of the iText.jar in the classpath; for instance, one version in the lib directory of your web application, and a different version in the lib directory of the application server. Different versions can lead to conflicts. Finally, check whether the application is compiled with the correct compiler. iText is compiled with Java 5, you can’t run it on a server that is running in an older Java Runtime Environment (JRE).

■ A resource can’t be found Many server-related problems are caused by an image, a font, or another resource that can’t be found. A file that was available for the standalone example might not be available for the web application. Normally, the exception will give you an indication where to look. Maybe the working directory of the servlet is different from what you expected. The problem can also be caused by permission issues, or simply by the fact that a resource isn’t present on the server. If the cause isn’t obvious, try reproducing the problem in a servlet that doesn’t involve iText. For instance, read the bytes of the resource file, and write them to the ServletInputStream. If this fails, your problem isn’t iText-related.

If the file generated on the server side is OK, or if none of the situations mentioned so far matches your problem, chances are that your problem is browser-related. THE BROWSER DOESN’T RECOGNIZE THE FILE AS A PDF

When an end user installs Adobe Reader, the browsers on the user’s OS should be detected and configured automatically. When a browser is installed, it should detect Adobe Reader if it’s present. If there’s no PDF viewer on the end user’s system, or if the PDF viewer isn’t configured correctly, the user will see content that looks like gibberish starting with %PDF-1.4 %aaio.

If this "gibberish problem" only occurs for a handful of end users, not for all your users, you’ll have to ask these people to install or reinstall their PDF viewer. If all users experience the same problem, the problem is caused on the server side. The viewer receives the PDF syntax, but shows it as if it were plain text. Maybe you didn’t set the content type correctly, in which case you need to add this line to your servlet: response.setContentType("application/pdf");

Old versions of MSIE ignore the content type; they only look at the file extension. PDFs ending with .pdf are rendered fine, but if you use a different URL pattern, the browser plug-in isn’t opened. The most elegant way to solve this problem is by using a URL pattern as shown in listing 9.2. If this is not an option, you could add a parameter ending in .pdf. For instance,

tmp89-245_thumb

Use this solution as a last recourse. A better solution is to set the content disposition in the response header:

tmp89-246_thumb[1]

Note that not every version of every browser deals with this header correctly.

THE PDF IS CORRUPT FOR ONLY A COUPLE OF BROWSERS

When no content length is specified in the header of your dynamically generated file, the browser reads blocks of bytes sent by the web server. Most browsers detect when the stream is finished and use the correct size of the dynamically generated file. Some browsers are known to have problems truncating the stream to the right size—the real size of the PDF is smaller than the size assumed by the browser. The surplus of bytes can contain gibberish, and this can cause the viewer plug-in to show an error message saying the file is corrupt.

If you can’t ask the end user to upgrade to a more recent browser and reader combination, there’s only one solution. You have to specify the content length of the PDF file in the response header. Setting this header has to be done before any content is sent. Unfortunately, you only know the length of the file after you’ve created it. This means you can’t send the PDF to the ServletOutputStream obtained with response.getOutputStream() right away. Instead, you must create the PDF on your

filesystem or in memory first (the next listing), so you can retrieve the length, add it to the response header, and send the PDF. This is also true for some other binary file formats.

Listing 9.3 PdfServlet.java

Listing 9.3 PdfServlet.java

Mailing list subscribers have shared their experience with the community and told us that it’s also safe to set extra response header values O. These headers make sure that the end user always gets the most recent version of the PDF, and not a PDF that is loaded from the cache on the client side. This is important if the content of the PDF changes frequently, which would happen if it reports about real-time data.

C solves the problem caused by old browser and PDF viewer configurations. Note that there are several serious downsides to this solution. When you need to generate large files, you risk an OutOfMemoryException on the server side, and a timeout on the client side. You can work around the server-side problem by writing the PDF to a temporary file on the server and asking the end user to fetch the file when it’s finished. Don’t forget to delete the file once it’s served to the browser.

The second problem, avoiding a browser timeout, can be solved by moving the five steps of the PDF creation process to a separate thread. You can add your Runnable implementation as an attribute to the HttpSession object. As long as the PDF document isn’t ready, send an HTML page to the browser that is refreshed on a regular basis, such as every three seconds. Check the thread with every hit; serve the PDF as soon as the document is closed. Not only does this solution solve the technical timeout problem, it also works on a psychological level. People tend to be impatient. They don’t like to wait for that internet page to come, not knowing if the connection got lost, whether or not they should hit the reload button, or if the server went down… Give them feedback—if possible, a progress bar showing the percentage of data that has been processed—and time seems to go a lot faster!

Usually, I implement the doPost() method to accept parameters and to set up the thread; then I cause a redirect to trigger the doGet() method that serves the HTML and eventually the finished PDF.

GET VERSUS POST

A trivial problem, but one that is easily overlooked, is what happens when people bookmark pages that are the result of a POST action. When they want to return to that page using the bookmark, they initiate a GET request, getting a result that differs from what they expect. You can do the experiment with the example from listing 9.3.

Figure 9.2 shows the URL http://itextpdf.org:8180/ opened in a Firefox window. This page contains two simple forms: one that uses the GET method, the other using the POST method. Recall that neither doGet() nor doPost() were implemented in listing 9.3. Instead you overrode the service() method that works in both cases. We’ll conclude the list of client-side issues with the "multiple-hit" problem.

PDFs created with GET and POST actions

Figure 9.2 PDFs created with GET and POST actions

PROBLEMS CAUSED BY MULTIPLE HITS

In web analytics, a hit is when an end user requests a page from your web server and this page is sent to the user’s browser directly. For example, when you enter the URI http://itextpdf.org:8180/hello.pdf in the location bar, one PDF file opens in your browser window using a PDF viewer plug-in. If I look in my server logs, I should see one line corresponding with this hit. This is true for most browsers, but some browsers hit the server several times for every dynamically generated binary file. You can’t predict how many hits a single request will generate; it could be two or three hits, or occasionallyjust one.

If you want to avoid this multiple-hit problem, you can try setting the cache parameters like this:

tmp89-249_thumb

Another way to solve the multiple-hit problem is to embed the PDF in an HTML page using the embed tag.

Listing 9.4 embedded.html

tmp89-250_thumb

If you skip to section 9.3, you’ll also find an example of how to embed a PDF in an HTML page using the object tag.

Using the tips and tricks summed up in this section, you should be able to tackle all the problems that can occur when writing a servlet that produces a PDF document. Writing a JSP page generating a PDF is another story.

Generating a PDF from a JSP page

It’s a bad idea to use JSP to generate binary content. That’s considered improper use of the technology. JSP wasn’t created to produce images, PDF files, or any other binary file type.

But that doesn’t mean it’s impossible. Go to http://itextpdf.org:8180/topic/hel-loworld.jsp and you’ll see a JSP page in action.

Listing 9.5 helloworld.jsp

Listing 9.5 helloworld.jsp Listing 9.5 helloworld.jsp

Please take my advice and don’t use this example. I’m only including it because the question, "How can I produce a PDF from a JSP page?" turns up on the mailing list on a regular basis. Let me explain why that is a bad idea, using this (working!) example.

Several things can go wrong if you ignore my advice and deploy the code from listing 9.5 on your server. If you write the bytes of the ByteArrayOutputStream to a file on the server, the PDF will be OK, but this doesn’t mean that the PDF will be OK when you send the same bytes to the browser. These are some potential problems:

■ The blank page problem for JSP pages It’s possible that the PDF opens when served on the client, showing nothing but blank pages. Some servers assume that JSP output isn’t binary, and every byte higher than 127 will show up as a question mark.

■ Whitespace corrupting the binary data JSP pages are compiled to a servlet internally. If you think writing a PDF servlet is more difficult than writing a PDF JSP page, think again. If you copy listing 9.5 and start working from there, you’ll probably add indentation, newlines, spaces, and carriage returns, inside as well as outside the <% and %> marks to make the JSP file more readable. Although this is good practice when you write JSP that produces HTML, it can be deadly if you want to generate binary content. If you look at the code of the servlet that is automatically generated based on the JSP file, you’ll see that the whitespace characters outside these marks are written to the OutputStream. This has the same effect as when you would open a JPG in a text editor and insert whitespace characters in arbitrary places.

OutputStream opened twice Your JSP code may go wrong even before you get the chance to corrupt your PDF file. If you’ve added whitespace before invoking response.getOutputStream(), an exception will be thrown, saying "getOut-putStream() has already been called for this response." Calling this method was done implicitly the moment the first unwanted whitespace characters appeared, and it’s forbidden to call that method a second time.

If you take all these warnings into consideration, you might be able to write a PDF-producing JSP page, but sooner or later you’ll run into troubles. Maybe a colleague will open that JSP in an IDE that automatically formats the code to make it more readable. While debugging problems like this, you’ll probably end up inspecting the servlet that

is generated, and eventually, you may want to replace the JSP page with a servlet. That’s why it’s better to stay away from JSP in the first place if you want to produce a PDF document. Write a servlet, and you’ll save time not only for yourself, but also for your employer. Maybe you can use this argument if using JSP is a requirement in your project.

Enough about JSP already. Let’s continue with servlets that involve PDF forms.

Next post:

Previous post: