Hardware Reference
In-Depth Information
assigning 16-bit numbers to characters was not highly political?). To make matters
worse, a full Japanese dictionary has 50,000 kanji (excluding names), so with only
20,992 code points available for the Han ideographs, choices had to be made. Not
all Japanese people think that a consortium of computer companies, even if a few
of them are Japanese, is the ideal forum to make these choices.
Guess what? 65,536 code points was not enough to satisfy everyone, so in
1996 an additional 16 planes of 16 bits each were added, expanding the total num-
ber of characters to 1,114,112.
UTF-8
Although better than ASCII, Unicode eventually ran out of code points and it
also requires 16 bits per character to represent pure ASCII text, which is wasteful.
Consequently, another coding scheme was developed to address these concerns. It
is called UTF-8 UCS Transformation Format where UCS stands for Universal
Character Set , which is essentially Unicode. UTF-8 codes are variable length,
from 1 to 4 bytes, and can code about two billion characters. It is the dominant
character set used on the World Wide Web.
One of the nice properties of UTF-8 is that codes 0 to 127 are the ASCII char-
acters, allowing them to be expressed in 1 byte (versus 2 bytes in Unicode). For
characters not in ASCII, the high-order bit of the first byte is set to 1, indicating
that 1 or more additional bytes follow. In all, six different formats are used, as il-
lustrated in Fig. 2-45. The bits marked ''d'' are data bits.
Bits
Byte 1
Byte 2
Byte 3
Byte 4
Byte 5
Byte 6
7
0ddddddd
11
110ddddd
10dddddd
16
1110dddd
10dddddd
10dddddd
21
11110ddd
10dddddd
10dddddd
10dddddd
26
111110dd
10dddddd
10dddddd
10dddddd
10dddddd
31
1111110x
10dddddd
10dddddd
10dddddd
10dddddd
10dddddd
Figure 2-45. The UTF-8 encoding scheme.
UTF-8 has a number of advantages over Unicode and other schemes. First, if a
program or document uses only characters that are in the ASCII character set, each
can be represented in 8 bits. Second, the first byte of every UTF-8 character
uniquely determines the number of bytes in the character. Third, the continuation
bytes in an UTF-8 character always start with 10, whereas the initial byte never
does, making the code self synchronizing. In particular, in the event of a commu-
nication or memory error, it is always possible to go forward and find the start of
the next character (assuming it has not been damaged).
 
 
Search WWH ::




Custom Search