Hindi and indic Scripts (Digital Library)

Unicode is advertised as a uniform way of representing all the characters used in all the world’s languages. Unicode fonts exist and are installed as standard with most commercial word processors and Web browsers. It is natural for people—particularly people from Western linguistic backgrounds— to assume that all problems associated with representing different languages on computers have been solved. Unfortunately, today’s Unicode-compliant applications fall far short of providing a satisfactory solution for languages with intricate scripts.

We use Hindi and related Indic scripts as an example. As Table 8.1 shows, the Unicode space from 0900 to 0DFF is reserved for ten Indic scripts. Although many hundreds of different languages are spoken in India, the principal officially recognized ones are Hindi, Marathi, Sanskrit, Punjabi, Bengali, Gujarati, Oriya, Assamese, Tamil, Telugu, Kannada, Malayalam, Urdu, Sindhi, and Kashmiri. The first 12 of these are written in one of nine writing systems that have evolved from the ancient Brahmi script. The remaining three, Urdu, Sindhi, and Kashmiri, are primarily written in Persian Arabic scripts, but can be written in Devanagari, too (Sindhi is also written in the Gujarati script). The nine scripts are Devanagari, Bengali, Gujarati, Oriya, and Gurmukhi (northern or Aryan scripts), and Tamil, Telugu, Kannada, and Malayalam (southern or Dravidian ones). Figure 8.8 gives some characters in each of these scripts. As you can see, the characters are beautiful—and the scripts differ radically from each other. Unicode also includes a script for Sinhalese, the official language of Sri Lanka.


Hindi, the official language of India, is written in Devanagari (pronounced Dayv’nagri, with the accent on the second a), which is used for writing Marathi and Sanskrit as well. (It is also the official script of Nepal.) The Punjabi language is written in Gurmukhi. Assamese is written in a script that is very similar to Bengali, but it has one additional glyph and another glyph that is different. In Unicode, the two scripts are merged, with distinctive code points for the two Assamese glyphs. Thus the Unicode scripts cover all 12 of the official Indian languages that are not written in Persian Arabic. All these scripts derive from Brahmi, and all are phonetically based. In fact the printing press did not reach the Indian subcontinent until missionaries arrived from Europe. The languages had a long time to evolve before they were fixed in print, which contributes to their diversity.

ISCII: Indian Script Code for Information Interchange

During the 1970s the Indian Department of Official Languages began working on devising codes that catered to all official Indic scripts. A standard keyboard layout was developed that provides a uniform way of entering them all.

Examples of characters in indic scripts

Figure 8.8: Examples of characters in indic scripts

Despite the very different scripts, the alphabets are phonetic and have a common Brahmi root that was used for ancient Sanskrit. The simultaneous availability of multiple Indic languages was intended to accelerate technological development and to facilitate national integration in India.

The result was ISCII, the Indian Script Code for Information Interchange. Announced in 1983 (and revised in 1988), it is an extension of ASCII that places new characters in the upper region of the code space. The code table supplies all the characters required in the Brahmi-based Indic scripts. Figure 8.9a shows the ISCII code table for the Devanagari script. Tables for the other scripts in Figure 8.8 are similar but contain differently shaped characters (and some entries are missing because there is no equivalent character in that script). The code table contains 56 characters, 10 digits (in the last line of Figure 8.9a), and 18 accents and combining characters. There are also three special escape codes, but we will not delve into their meaning here.

Unicode for Indic scripts

The Unicode developers adopted ISCII lock, stock, and barrel—they had to, because of their policy of round-trip compatibility with existing codes. They used different parts of the code space for the various scripts, which means that (in contrast to ISCII) documents containing multiple scripts can easily be represented. However, they also included some extra characters—about 10 of them—that in the original ISCII design were supposed to be formed from combinations of other keystrokes. Figure 8.9b shows the Unicode code table for the Devanagari script.

Most of the extra characters give a shorthand for frequently used characters, and they differ from one language to another. An example in Devanagari is the character Om, a Hindu religious symbol:

Devanagari script: (a) ISCII; (b) Unicode (U+0900-U+0970); (c) code table for the Surekh font

Figure 8.9: Devanagari script: (a) ISCII; (b) Unicode (U+0900-U+0970); (c) code table for the Surekh font

Although it is not part of the ISCII set, it can be created from the keyboard by typing the sequence of characters (ISCII A8 A1 E9).

The third character (ISCII E9) is a special diacritic sign called the Nukta (which phonetically represents nasalization of the preceding vowel). ISCII defines Nukta as an operator used to derive some little-used Sanskrit characters that are not otherwise available from the keyboard, such as Om. However, Unicode includes these lesser-used characters as part of the character set (U+0950 and U+0958 through U+095F).

Although the Unicode solution is designed to adequately represent all the Indic scripts, it has not yet found widespread acceptance. A practical problem is that these scripts contain numerous clusters of two to four consonants without any intervening vowels, called conjuncts. Conjuncts are similar to the ligatures discussed earlier, characters represented by a single glyph whose shape differs from the shapes of the constituents. Indic scripts contain far more of these, and there is a greater variation in shape. For example, the conjunct

tmpB-122_thumb

is equivalent to the two-character combination

tmpB-123_thumb

In this particular case, the conjunct happens to be defined as a separate code in Unicode (U+090C)— just as the ligature fi has its own code (U+FB01). The problem is that this is not always the case. In the ISCII design, all conjuncts are formed by placing a special character between the constituent consonants, in accordance with the design goal of a uniform representation for input of all Indic languages on a single keyboard. In Unicode, some conjuncts are given their own code—like the one above—but others are not.

Problems with the adoption of Unicode

Figure 8.9c shows the code table for a particular commercially available Devanagari font, Surekh (a member of the ISFOC family of fonts). Although there is much overlap, there is certainly not a one-to-one correspondence between the Unicode and Surekh characters, as can be seen in Figures 8.9b and c. Some conjuncts, represented by two to four Unicode codes, correspond to a single glyph that does not have a separate Unicode representation but does have a corresponding entry in the font. And in fact the converse is true: there are single glyphs in the Unicode table that are produced by generating pairs of characters. For example, the Unicode symbol is drawn by specifying a sequence of three codes in the Surekh font.

tmpB-124_thumb 

We cannot give a more detailed explanation of why such choices have been made—it is a controversial subject, and a full discussion would require a book in itself. However, the fact is that the adoption of Unicode in India has been delayed because some people feel that it represents an uncomfortable compromise between the clear but spare design principles of ISCII and the practical requirements of actual fonts. They prefer to represent their texts in the original ISCII, because they regard it as conceptually cleaner.

The problem is compounded by the fact that today’s word processors and Web browsers take a simplistic view of fonts. In reality, combination rules are required—and were foreseen by the designers of Unicode—that take a sequence of Unicode characters and produce the corresponding single glyph from Figure 8.9c. Such rules can be embodied in the "composite fonts" that were described in Section 4.5. But ligatures in English, such as fi, have their own Unicode entry, which makes things much easier. For example, the "insert-symbol" function of word processors implements a one-to-one correspondence between Unicode codes and the glyphs on the page.

The upshot is that languages like Hindi are not properly supported by current Unicode-compliant applications. A table of glyphs, one for each Unicode value, is insufficient to depict text in Hindi script. To make matters worse, in practice some Hindi documents are represented using ISCII while others are represented using raw font codes like that of Figure 8.9c, which are specific to the particular font manufacturer. Different practices have grown up for different scripts. For example, the majority of documents in the Kannada language on the Web seem to be represented using ISCII codes, whereas in the Malayalam language, diverse font-specific codes are used. Often, to read a new Malayalam newspaper in your Web browser you have to download a new font!

To accommodate Indic documents in a digital library that represents documents internally in Unicode, it is necessary to implement several mappings:

• from ISCII to Unicode, so that ISCII documents can be incorporated;

• from various different font representations (such as ISFOC, used for the Surekh font) to Unicode, so that documents in other formats can be accommodated;

• from Unicode to various different font representations (such as ISFOC), so that the documents can be displayed on computer systems with different fonts.

The first is a simple transliteration, because Unicode was designed for round-trip compatibility. However, both of the other mappings involve translating sequences of codes in one space into corresponding sequences in the other space (although all sequences involved are very short). Figure 8.10 shows a page produced by such a scheme.

Page produced by a digital library in Devanagari script

Figure 8.10: Page produced by a digital library in Devanagari script

Next post:

Previous post: