Using fonts in iText Part 2

USING OTHER CMAPS

The UCS2 in the CMap names listed in table 11.4 stands for Universal Character Set. There’s also a JAR named iTextAsianCmaps.jar with the contents of the com/itextpdf/ text/pdf/cmaps/ directory. These CMaps can be used in combination with the PdfEn-codings class to convert a String in a specific encoding to a String with 2-byte CIDs.

For example, if you have a char[] encoded in the GB 18030-2000 character set, you need to load the CMap GBK2K-H and convert it to a sequence of Identity-H CIDs like this:

tmp89-394_thumb[1]

We’ve discussed Asian languages written vertically. Now let’s find out how to write Semitic languages such as Hebrew and Arabic; these are written from right to left.

Writing from right to left

Figure 11.9 shows an XML file with the text "Say Peace in all languages" in English, Arabic, and Hebrew. The XML is encoded using UTF-8, and in the top left of the figure it’s opened in WordPad, which assumes it’s plain text, hence the strange characters. You’ll use this XML file to create the PDF document shown in the foreground. But you’ll start by creating the PDF showing the movie title Nina’s Tragedies in Hebrew.


Writing from right to left

Figure 11.9 Writing from right to left

Writing from right to left is only supported when using ColumnText or PdfPCell objects.

RUN DIRECTION IN COLUMNTEXT OBJECTS

If you don’t know Hebrew, you’ll probably try to read the glyphs of the Hebrew movie title from left to right. You see four glyphs, a space, two glyphs, a space, and the rest of the title. Let’s compare this with the original String here.

Listing 11.11 RightToLeftExample.java

Listing 11.11 RightToLeftExample.java

The String that is passed to the ColumnText object includes seven 2-byte characters, a space, two characters, a space, and four characters. In reality, the first glyph on the title line in figure 11.9 is \u05d4, followed by \u05e0, and so on: iText has added the characters in reverse order because you changed the run direction with the setRun-Direction() method. This method accepts one of the following options:

RUN_DIRECTION_DEFAULT—Uses the default direction

RUN_DIRECTION_LTR—Uses bidirectional reordering with a left-to-right preferential run direction

RUN_DIRECTION_NO_BIDI—Doesn’t use bidirectional reordering RUN_DIRECTION_RTL—Uses bidirectional reordering with a right-to-left preferential run direction

The best way to understand what bidirectional means is to look at the message of peace in figure 11.9. In this text, the term I18N (Internationalization) is used. If you choose RTL as the run direction, you don’t want this term to be reordered as N81I; you want to preserve the order of the Latin text. Choosing the option RUN_DIRECTION_RTL means that the characters are reordered from right to left by preference, but if Latin text is encountered, the left-to-right order is preserved. The PDF containing the message of peace was created using a PdfPTable.

RUN DIRECTION IN PDFPCELL OBJECTS

You can use the technique explained in topic 9 to convert the XML file into a PDF document. This is what to do when an opening or a closing tag is encountered.

Listing 11.12 SayPeace.java

Listing 11.12 SayPeace.java Listing 11.12 SayPeace.java

The Arabic text produced by this example looks all right, but it’s important to understand that iText has done a lot of work behind the scenes. Not every character in the XML file was rendered as a separate glyph. Some characters or glyphs were combined and replaced.

To understand what happens, we need to talk about diacritics and ligatures.

Advanced typography

If you want to see a Thai cowboy movie about a poor hero who falls in love with a girl from the upper classes, you should buy a ticket for Tears of the Black Tiger at the Foobar Film Festival. Figure 11.10 shows the poster featuring the protagonists. As you can see in the figure, the Thai title is also printed:

tmp89-399_thumb

Using diacritics, ancillary glyphs added to a letter

Figure 11.10 Using diacritics, ancillary glyphs added to a letter

tmp89-401_thumb[1]

This String is written twice, using two different fonts: Angsana New and Arial Unicode MS.

Listing 11.13 Diacriticsl.java

Listing 11.13 Diacriticsl.java

If you look at the poster, you’ll see that there’s a special curl above the first character. This curl is a diacritical mark.

DIACRITICAL MARKS

You’ve used diacritical marks before. In figure 11.5, you’ll find a cedilla, a hacek, and so on, but there’s a difference between listing 11.4 and listing 11.13. When you printed c cedilla (5), you only used one Unicode character (\u00e7). In listing 11.13, the diacritical mark is a separate character: \u0e49 has to be combined with \u0e1f.

That’s unusual for Western languages. Suppose that you want to type the French word etre (to be) on a French keyboard (AZERTY instead of QWERTY); you’d need to hit five different keys: Aetre. If you save a text file with this String, only four bytes would be used, because there’s a character value for the e with a circumflex.

In other languages that use diacritics more frequently, it’s common to store both characters separately. For instance: Aetre or eAtre instead of etre. That’s what happened in the Thai example. The fonts Angsana New and Arial Unicode MS used a negative character advance for this glyph to create the illusion that the two characters are actually one.

CHANGING THE CHARACTER ADVANCE

The character advance is stored in the font’s metrics, but you can change this value in the iText BaseFont object. The second example in figure 11.10 is somewhat artificial, but it demonstrates how the mechanism works. The original title of the best Swedish film of 1999 is written like this:

tmp4042_thumb

The title literally means "Santa Claus is the father of all children," but it was translated into "In bed with Santa," probably to prevent parents from bringing young children to the movie.

The next example shows how you can change the character advance of the " character so that it’s positioned on top of the next character that is added.

This shows how the title was added to the Document.

Listing 11.14 Diacritics2.java

Listing 11.14 Diacritics2.javaListing 11.14 Diacritics2.java

The width of the umlaut (or dieresis) glyph is 333 units in Arial (glyph space). To get the umlaut or dieresis above the letter that follows the diacritical mark, you change the advance to a negative value.

The value used here is ideal for the letter a, but there’s no guarantee that it will fit perfectly above the other vowels, because they may have a different width and character advance. Arial is a proportional font, meaning that different glyphs have different widths. This problem doesn’t occur when you use a fixed-width or monospaced font such as Courier. Every character in Courier has the same width (600 units in glyph space). To make the diacritical mark fit above the next character, it’s sufficient to set its advance to 0.

CHANGING THE CHARACTER WIDTH

In listing 11.15, a proportional font is changed into a fixed width font. You use the method getWidths() to get an array containing the widths of every character in the font (measured in glyph space). You then change the width to 600 units for every glyph with a width greater than 0.

Listing 11.16 ExtraCharSpace.java

Listing 11.16 ExtraCharSpace.java

Ligatures are another example involving advanced typography.

Listing 11.15 Monospace.java

Listing 11.15 Monospace.java

You force iText to use the changed widths with the method setForceWidthsOutput(). This feature is very useful if you want to print a Chinese text where every ideogram needs to have the same width, but it’s not very elegant when you use it for Western fonts. If you want to use it to attribute more space to every character to get this effect, you should use the setCharacterSpacing() method.

LIGATURES

A ligature occurs when a combination of two or more characters is considered to be one and only one glyph. A letter with a diacritic isn’t usually called a ligature, but the same principle applies. One of the ligatures we all know—though we may have forgotten it’s a ligature—is the & character. The ampersand sign was originally a ligature for the Latin word et (meaning and). Figure 11.11 shows a movie title with ligatures in Danish and Arabic.

As is the case with diacritics, you usually don’t have to worry about ligatures in languages using Latin text. Usually, you’ll use only one character for the ligature:

Listing 11.17 Ligatures1.java

Listing 11.17 Ligatures1.java

The combination "ae" is changed into the combination "/o" into "0". Similar code, but much more complex than this small snippet, is present in iText to make Arabic ligatures.

Using ligatures, joining different glyphs into one

Figure 11.11 Using ligatures, joining different glyphs into one

tmp4049_thumb

If you want to use more than one character, you’ll need to write code that makes the ligature for you. Suppose you want to add a String like this:

tmp40410_thumb

You need to write your own ligaturize() method as is done next.

WRITING ARABIC

In figure 11.11, the Arabic translation of the movie title Lawrence of Arabia has been added three times. The first version of the title is wrong because the glyphs are added from left to right, whereas Arabic is written from right to left (see section 11.3.4). In the second version, the glyphs are written in reverse order, but a space was added between all characters. No ligatures are made, so the title isn’t rendered correctly. Compare this line with the next one. The omission of the extra spaces is the only difference, but if you look closely, you can see that some character combinations were replaced by another glyph. That was done by the iText class ArabicLigaturizer.

If you study listing 11.18, you can see that you don’t have to do anything special to start up the ArabicLigaturizer. If the run direction is RTL and if iText detects Unicode characters in the Arabic character set, this is done automatically.

Listing 11.18 Ligatures2.java

Listing 11.18 Ligatures2.java

Note that the setRunDirection() method only exists for the classes PdfPCell and ColumnText. Both classes also have a setArabicOption() method to tell iText how to deal with vowels in Arabic. These are the possible values for the parameter:

ColumnText.AR_NOVOWEL—Eliminates Arabic vowels

ColumnText.AR_COMPOSEDTASHKEEL—Composes the Tashkeel on the ligatures ColumnText.AR_LIG—Does extra double ligatures

None of these options has any effect on this example, but it can be useful information if you need advanced Arabic support.

This is highly specialized functionality; it’s time to return to everyday use of iText and look at some classes that make working with fonts easier.

Next post:

Previous post: