Hardware Reference
In-Depth Information
between ASCII and Unicode easy. To avoid wasting code points, each diacritical
mark has its own code point. It is up to software to combine diacritical marks with
their neighbors to form new characters. While this puts more work on the soft-
ware, it saves precious code points.
The code point space is divided up into blocks, each one a multiple of 16 code
points. Each major alphabet in Unicode has a sequence of consecutive zones.
Some examples (and the number of code points allocated) are Latin (336), Greek
(144), Cyrillic (256), Armenian (96), Hebrew (112), Devanagari (128), Gurmukhi
(128), Oriya (128), Telugu (128), and Kannada (128). Note that each of these lan-
guages has been allocated more code points than it has letters. This choice was
made in part because many languages have multiple forms for each letter. For ex-
ample, each letter in English has two forms—lowercase and UPPERCASE. Some
languages have three or more forms, possibly depending on whether the letter is at
the start, middle, or end of a word.
In addition to these alphabets, code points have been allocated for diacritical
marks (112), punctuation marks (112), subscripts and superscripts (48), currency
symbols (48), math symbols (256), geometric shapes (96), and dingbats (192).
After these come the symbols needed for Chinese, Japanese, and Korean. First
are 1024 phonetic symbols (e.g., katakana and bopomofo) and then the unified Han
ideographs (20,992) used in Chinese and Japanese, and the Korean Hangul sylla-
bles (11,156).
To allow users to invent special characters for special purposes, 6400 code
points have been allocated for local use.
While Unicode solves many problems associated with internationalization, it
does not (attempt to) solve all the world's problems. For example, while the Latin
alphabet is in order, the Han ideographs are not in dictionary order. As a conse-
quence, an English program can examine ''cat'' and ''dog'' and sort them alphabet-
ically by simply comparing the Unicode value of their first character. A Japanese
program needs external tables to figure out which of two symbols comes before the
other in the dictionary.
Another issue is that new words are popping up all the time. Fifty years ago
nobody talked about apps, chatrooms, cyberspace, emoticons, gigabytes, lasers,
modems, smileys, or videotapes. Adding new words in English does not require
new code points. Adding them in Japanese does. In addition to new technical
words, there is a demand for adding at least 20,000 new (mostly Chinese) personal
and place names. Blind people think Braille should be in there, and special interest
groups of all kinds want what they perceive as their rightful code points. The Uni-
code consortium reviews and decides on all new proposals.
Unicode uses the same code point for characters that look almost identical but
have different meanings or are written slightly differently in Japanese and Chinese
(as though English word processors always spelled ''blue'' as ''blew'' because they
sound the same). Some people view this as an optimization to save scarce code
points; others see it as Anglo-Saxon cultural
imperialism (and you thought
Search WWH ::




Custom Search