Java Reference
In-Depth Information
8-bit Character Sets
The ASCII character set worked fine for the English language. Representing the alphabets from other languages,
for example, French and German, led to the development of an 8-bit character set. An 8-bit character set defines 2 8
(or 256) character positions whose numeric values range from 0 to 255. The bit combination for an 8-bit character
set ranges from 00000000 to 11111111. The 8-bit character set is divided into two parts. The first part represents
characters, which are the same as in ASCII character set. The second part introduces 128 new characters. The first
32 positions in the second part are reserved for control characters. Therefore, there are two control character areas
in an 8-bit character set: 0-31 and 128-159. Since SPACE and DELETE characters are already defined in the first part,
an 8-bit character set can accommodate 192 printing characters (95 + 97) including SPACE. ISO Latin-1 is one of the
examples of an 8-bit character set.
Even an 8-bit character set is not large enough to accommodate most of the alphabets of all languages in the
world. This lead to the development of a bigger (may be the biggest) character set, which is known as the Universal
Character Set (UCS).
Universal Multiple-Octet Coded Character Set (UCS)
The Universal Multiple-Octet Coded Character Set, simply known as UCS, is intended to provide a single coded
character set for the encoding of written forms of all the languages of the world and of a wide range of additional
symbols that may be used in conjunction with such languages. It is intended not only to cover languages in current
use, but also languages of the past and such additions as may be required in the future. The UCS uses a 4-octet
(1 octet is 8 bits) structure to represent a character. However, the most significant bit of the most significant octet is
constrained to be 0, which permits its use for private internal purposes in a data processing system. The remaining 31
bits allow us to represent more than two thousand million characters. The four octets are named as
The Group-Octet, or G
The Plane-Octet, or P
The Row-Octet, or R
The Cell-Octet, or C
G is the most significant octet and C is the least significant octet. So, the whole code range for UCS is viewed as a
four-dimensional structure composed of
128 groups
256 planes in each group
256 rows in each plane
256 cells in each row
Two hexadecimal digits (0-9, A-F) specify the values of any octet. The values of G are restricted to the range 00-7F.
The plane with G=00 and P=00 is known as Basic Multilingual Plane (BMP). The row of BMP with R=00 represents
the same set of characters as 8-bit ISO Latin-I. Therefore, the first 128 characters of ASCII, ISO Latin-1 and BMP with
R=00 match. Characters 129 th to 256 th of ISO Latin-I and that of BMP with R=00 match. This makes UCS compatible
with the existing 7-bit ASCII and 8-bit ISO Latin-I. Further, BMP has been divided into five zones.
A-zone: It is used for alphabetic and symbolic scripts together with various symbols. The code
position available for A-zone ranges from 0000-4DFF. The code positions 0000-001F and
0080-009F are reserved for control characters. The code position 007F is reserved for the
DELETE character. Thus, it has 19903 code positions available for graphics characters.
I-zone: It is used for Chinese/Japanese/Korean (CJK) unified ideographs. Its range is
4E00-9FFF, so 20992 code positions are available in this zone.
 
Search WWH ::




Custom Search