Java Reference
In-Depth Information
Appendix A
Character Encodings
A character is the basic unit of a writing system, for example, a letter of the English alphabet, and an ideograph of an
ideographic writing system such as Chinese and Japanese ideographs. In the written form, a character is identified
by its shape, also known as glyph. The identification of a character with its shape is not precise. It depends on many
factors, for example, a hyphen is identified as a minus sign in a mathematical expression; some Greek and Latin letters
have the same shapes, but they are considered different characters in two written scripts. Computers understand only
numbers, more precisely, only bits 0 and 1. Therefore, it was necessary to convert, with the advent of computers, the
characters into codes (or bit combinations) inside the computer's memory, so that the text (sequence of characters)
could be stored and reproduced. However, different computers may represent different characters with the same
bit combinations, which may lead to misinterpretation of text stored by one computer system and reproduced by
another. Therefore, for correct exchange of information between two computer systems, it is necessary that one
computer system understand unambiguously the coded form of the characters represented in bit combination
produced by another computer system and vice versa. Before we begin our discussion of some widely used character
encodings, it is necessary to understand some commonly used terms.
An abstract character is a unit of textual information, for example, Latin capital letter A ('A').
A character repertoire is defined as the set of characters to be encoded. A character repertoire
can be fixed or open. In a fixed character repertoire, once the set of characters to be encoded is
decided, it is never changed. ASCII and POSIX portable character repertoire are examples of a
fixed character repertoire. In an open character repertoire, a new character may be added any
time. Unicode and Windows Western European repertoires are examples of an open character
repertoire. The EURO currency sign and Indian RUPEE sign were added to Unicode because
it is an open repertoire.
A coded character set is defined as a mapping from a set of non-negative integers (also
known as code positions, code points, code values, character numbers, and code space) to
a set of abstract characters. The integer that maps to a character is called the code point for
that character and the character is called an encoded character. A coded character set is also
called a character encoding, coded character repertoire, character set definition, or code page.
Figure A-1 depicts two different coded character sets; both of them have the same character
repertoire, which is the set of three characters (A, B, and C) and the same code points, which is
the set of three non-negative integers (1, 2, and 3).
 
Search WWH ::




Custom Search