Character Encodings - Beginning Java 8 Fundamentals

Java Reference

In-Depth Information

8-bit Character Sets

The ASCII character set worked fine for the English language. Representing the alphabets from other languages,

for example, French and German, led to the development of an 8-bit character set. An 8-bit character set defines 2 8

(or 256) character positions whose numeric values range from 0 to 255. The bit combination for an 8-bit character

set ranges from 00000000 to 11111111. The 8-bit character set is divided into two parts. The first part represents

characters, which are the same as in ASCII character set. The second part introduces 128 new characters. The first

32 positions in the second part are reserved for control characters. Therefore, there are two control character areas

in an 8-bit character set: 0-31 and 128-159. Since SPACE and DELETE characters are already defined in the first part,

an 8-bit character set can accommodate 192 printing characters (95 + 97) including SPACE. ISO Latin-1 is one of the

examples of an 8-bit character set.

Even an 8-bit character set is not large enough to accommodate most of the alphabets of all languages in the

world. This lead to the development of a bigger (may be the biggest) character set, which is known as the Universal

Character Set (UCS).

Universal Multiple-Octet Coded Character Set (UCS)

The Universal Multiple-Octet Coded Character Set, simply known as UCS, is intended to provide a single coded

character set for the encoding of written forms of all the languages of the world and of a wide range of additional

symbols that may be used in conjunction with such languages. It is intended not only to cover languages in current

use, but also languages of the past and such additions as may be required in the future. The UCS uses a 4-octet

(1 octet is 8 bits) structure to represent a character. However, the most significant bit of the most significant octet is

constrained to be 0, which permits its use for private internal purposes in a data processing system. The remaining 31

bits allow us to represent more than two thousand million characters. The four octets are named as

•

The Group-Octet, or G

•

The Plane-Octet, or P

•

The Row-Octet, or R

•

The Cell-Octet, or C

G is the most significant octet and C is the least significant octet. So, the whole code range for UCS is viewed as a

four-dimensional structure composed of

•

128 groups

•

256 planes in each group

•

256 rows in each plane

•

256 cells in each row

Two hexadecimal digits (0-9, A-F) specify the values of any octet. The values of G are restricted to the range 00-7F.

The plane with G=00 and P=00 is known as Basic Multilingual Plane (BMP). The row of BMP with R=00 represents

the same set of characters as 8-bit ISO Latin-I. Therefore, the first 128 characters of ASCII, ISO Latin-1 and BMP with

R=00 match. Characters 129 th to 256 th of ISO Latin-I and that of BMP with R=00 match. This makes UCS compatible

with the existing 7-bit ASCII and 8-bit ISO Latin-I. Further, BMP has been divided into five zones.

•

A-zone: It is used for alphabetic and symbolic scripts together with various symbols. The code

position available for A-zone ranges from 0000-4DFF. The code positions 0000-001F and

0080-009F are reserved for control characters. The code position 007F is reserved for the

DELETE character. Thus, it has 19903 code positions available for graphics characters.

•

I-zone: It is used for Chinese/Japanese/Korean (CJK) unified ideographs. Its range is

4E00-9FFF, so 20992 code positions are available in this zone.

Beginning Java 8 Fundamentals

Search WWH ::

Custom Search

Home