Java Reference
In-Depth Information
C
The Unicode
Character Set
The Java programming language uses the Unicode character set for managing
text. A character set is simply an ordered list of characters, each corresponding to
a particular numeric value. Unicode is an international character set that contains
letters, symbols, and ideograms for languages all over the world. Each character
is represented as a 16-bit unsigned numeric value. Unicode, therefore, can support
over 65,000 unique characters. Only about half of those values have characters
assigned to them at this point. The Unicode character set continues to be refined
as characters from various languages are included.
Many programming languages still use the ASCII character set. ASCII stands
for the American Standard Code for Information Interchange. The 8-bit extended
ASCII set is quite small, so the developers of Java opted to use Unicode in order
to support international users. However, ASCII is essentially a subset of Unicode,
including corresponding numeric values, so programmers used to ASCII should
have no problems with Unicode.
Figure C.1 shows a list of commonly used characters and their Unicode
numeric values. These characters also happen to be ASCII characters. All of the
characters in Figure C.1 are called printable characters because they have a sym-
bolic representation that can be displayed on a monitor or printed by a printer.
Other characters are called nonprintable characters because they have no such
symbolic representation. Note that the space character (numeric value 32) is
considered a printable character, even though no symbol is printed when it is dis-
played. Nonprintable characters are sometimes called control characters because
many of them can be generated by holding down the control key on a keyboard
and pressing another key.
The Unicode characters with numeric values 0 through 31 are nonprintable
characters. Also, the delete character, with numeric value 127, is a nonprintable
673
 
 
Search WWH ::




Custom Search