Internationalization - Web Standards: Mastering HTML5, CSS3, and XML

HTML and CSS Reference

In-Depth Information

For web publishing, UTF-8 is recommended, which provides interoperability and backward compatibility with

US-ASCII 3 and has further advantageous characteristics [3]. UTF-8 supports internationalized resource identifiers

(IRIs, multilingual web addresses) [4, 5]. UTF-8 uses at least one byte for encoding while UTF-16 uses two, so a UTF-8

encoded file tends to be smaller than a UTF-16 encoded file. UTF-8 is byte oriented, while UTF-16 and UTF-32 are

not; in other words, the byte order should be declared for UTF-16 and UTF-32 files by the byte-order mark (see the

section “The Byte-Order Mark (BOM)”). UTF-8 is better in recovering from errors than the other Unicode flavors.

There are further variants of UTF-16 and UTF-32, depending on the endianness , which is the order of individually

addressable subcomponents within the character set. If the most significant byte is the first byte (lowest address) and

the least significant byte is the last byte (highest address), the file is called big-endian (UTF-16BE, UTF-32BE). If these

bytes are reversed, the file is referred to as little-endian (UTF-16LE, UTF-32LE). Table 2-2 summarizes the differences

between UTF-8 and the variants of UTF-16 and UTF-32.

Table 2-2. Comparison of Unicode Encoding Schemes

Encoding

UTF-8

UTF-16

UTF-16BE

UTF-16LE

UTF-32

UTF-32BE

UTF-32LE

Smallest code point 0000

0000

Largest code point

10FFFF

Code unit size

8 bits

16 bits

32 bits

Byte order

Not provided

BOM

Big-endian

Little-endian

BOM

Big-endian

Little-endian

Fewest bytes per

character

1

2

4

Most bytes per

character

4

According to the HTML5 specification, “authors are encouraged to use UTF-8. Conformance checkers may

advise authors against using legacy encodings [6]. Authoring tools should default to using UTF-8 for newly created

documents [7].”

Characters That Should Be Avoided In the Markup

Some Unicode characters should not be applied in HTML markup and XML documents (see Table 2-3 ) because of

one or more of the following reasons:

•

They are deprecated in the Unicode standard.

•

They cannot be supported without additional data.

•

They are difficult to handle because they are stateful.

4

•

They can be handled more efficiently with markup.

•

They should be avoided because of the potential conflict they could cause with

equivalent markup.

3 All US-ASCII characters use exactly the same bytes in UTF-8 as in US-ASCII; i.e., a UTF-8 file that contains only ASCII

characters is identical to an ASCII file.

4 A character represented by a particular value in the text depends on values provided earlier in the text stream, e.g., escape

sequences or bidirectional embedding controls.

Web Standards: Mastering HTML5, CSS3, and XML

Search WWH ::

Custom Search

Home