Character Encodings - Beginning Java 8 Fundamentals

Java Reference

In-Depth Information

Appendix A

Character Encodings

A character is the basic unit of a writing system, for example, a letter of the English alphabet, and an ideograph of an

ideographic writing system such as Chinese and Japanese ideographs. In the written form, a character is identified

by its shape, also known as glyph. The identification of a character with its shape is not precise. It depends on many

factors, for example, a hyphen is identified as a minus sign in a mathematical expression; some Greek and Latin letters

have the same shapes, but they are considered different characters in two written scripts. Computers understand only

numbers, more precisely, only bits 0 and 1. Therefore, it was necessary to convert, with the advent of computers, the

characters into codes (or bit combinations) inside the computer's memory, so that the text (sequence of characters)

could be stored and reproduced. However, different computers may represent different characters with the same

bit combinations, which may lead to misinterpretation of text stored by one computer system and reproduced by

another. Therefore, for correct exchange of information between two computer systems, it is necessary that one

computer system understand unambiguously the coded form of the characters represented in bit combination

produced by another computer system and vice versa. Before we begin our discussion of some widely used character

encodings, it is necessary to understand some commonly used terms.

•

An abstract character is a unit of textual information, for example, Latin capital letter A ('A').

•

A character repertoire is defined as the set of characters to be encoded. A character repertoire

can be fixed or open. In a fixed character repertoire, once the set of characters to be encoded is

decided, it is never changed. ASCII and POSIX portable character repertoire are examples of a

fixed character repertoire. In an open character repertoire, a new character may be added any

time. Unicode and Windows Western European repertoires are examples of an open character

repertoire. The EURO currency sign and Indian RUPEE sign were added to Unicode because

it is an open repertoire.

•

A coded character set is defined as a mapping from a set of non-negative integers (also

known as code positions, code points, code values, character numbers, and code space) to

a set of abstract characters. The integer that maps to a character is called the code point for

that character and the character is called an encoded character. A coded character set is also

called a character encoding, coded character repertoire, character set definition, or code page.

Figure A-1 depicts two different coded character sets; both of them have the same character

repertoire, which is the set of three characters (A, B, and C) and the same code points, which is

the set of three non-negative integers (1, 2, and 3).

Search WWH ::

Custom Search

Home