COMPUTER SYSTEMS ORGANIZATION - Structured Computer Organization

Hardware Reference

In-Depth Information

assigning 16-bit numbers to characters was not highly political?). To make matters

worse, a full Japanese dictionary has 50,000 kanji (excluding names), so with only

20,992 code points available for the Han ideographs, choices had to be made. Not

all Japanese people think that a consortium of computer companies, even if a few

of them are Japanese, is the ideal forum to make these choices.

Guess what? 65,536 code points was not enough to satisfy everyone, so in

1996 an additional 16 planes of 16 bits each were added, expanding the total num-

ber of characters to 1,114,112.

UTF-8

Although better than ASCII, Unicode eventually ran out of code points and it

also requires 16 bits per character to represent pure ASCII text, which is wasteful.

Consequently, another coding scheme was developed to address these concerns. It

is called UTF-8 UCS Transformation Format where UCS stands for Universal

Character Set , which is essentially Unicode. UTF-8 codes are variable length,

from 1 to 4 bytes, and can code about two billion characters. It is the dominant

character set used on the World Wide Web.

One of the nice properties of UTF-8 is that codes 0 to 127 are the ASCII char-

acters, allowing them to be expressed in 1 byte (versus 2 bytes in Unicode). For

characters not in ASCII, the high-order bit of the first byte is set to 1, indicating

that 1 or more additional bytes follow. In all, six different formats are used, as il-

lustrated in Fig. 2-45. The bits marked ''d'' are data bits.

Bits

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

7

0ddddddd

11

110ddddd

10dddddd

16

1110dddd

10dddddd

21

11110ddd

10dddddd

26

111110dd

10dddddd

31

1111110x

10dddddd

Figure 2-45. The UTF-8 encoding scheme.

UTF-8 has a number of advantages over Unicode and other schemes. First, if a

program or document uses only characters that are in the ASCII character set, each

can be represented in 8 bits. Second, the first byte of every UTF-8 character

uniquely determines the number of bytes in the character. Third, the continuation

bytes in an UTF-8 character always start with 10, whereas the initial byte never

does, making the code self synchronizing. In particular, in the event of a commu-

nication or memory error, it is always possible to go forward and find the start of

the next character (assuming it has not been damaged).

Structured Computer Organization

Search WWH ::

Custom Search

Home