HTML and CSS Reference
In-Depth Information
For web publishing, UTF-8 is recommended, which provides interoperability and backward compatibility with
US-ASCII 3 and has further advantageous characteristics [3]. UTF-8 supports internationalized resource identifiers
(IRIs, multilingual web addresses) [4, 5]. UTF-8 uses at least one byte for encoding while UTF-16 uses two, so a UTF-8
encoded file tends to be smaller than a UTF-16 encoded file. UTF-8 is byte oriented, while UTF-16 and UTF-32 are
not; in other words, the byte order should be declared for UTF-16 and UTF-32 files by the byte-order mark (see the
section “The Byte-Order Mark (BOM)”). UTF-8 is better in recovering from errors than the other Unicode flavors.
There are further variants of UTF-16 and UTF-32, depending on the endianness , which is the order of individually
addressable subcomponents within the character set. If the most significant byte is the first byte (lowest address) and
the least significant byte is the last byte (highest address), the file is called big-endian (UTF-16BE, UTF-32BE). If these
bytes are reversed, the file is referred to as little-endian (UTF-16LE, UTF-32LE). Table 2-2 summarizes the differences
between UTF-8 and the variants of UTF-16 and UTF-32.
Table 2-2. Comparison of Unicode Encoding Schemes
Encoding
UTF-8
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
Smallest code point 0000
0000
0000
0000
0000
0000
0000
Largest code point
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
10FFFF
Code unit size
8 bits
16 bits
16 bits
16 bits
32 bits
32 bits
32 bits
Byte order
Not provided
BOM
Big-endian
Little-endian
BOM
Big-endian
Little-endian
Fewest bytes per
character
1
2
2
2
4
4
4
Most bytes per
character
4
4
4
4
4
4
4
According to the HTML5 specification, “authors are encouraged to use UTF-8. Conformance checkers may
advise authors against using legacy encodings [6]. Authoring tools should default to using UTF-8 for newly created
documents [7].”
Characters That Should Be Avoided In the Markup
Some Unicode characters should not be applied in HTML markup and XML documents (see Table 2-3 ) because of
one or more of the following reasons:
They are deprecated in the Unicode standard.
They cannot be supported without additional data.
They are difficult to handle because they are stateful.
4
They can be handled more efficiently with markup.
They should be avoided because of the potential conflict they could cause with
equivalent markup.
3 All US-ASCII characters use exactly the same bytes in UTF-8 as in US-ASCII; i.e., a UTF-8 file that contains only ASCII
characters is identical to an ASCII file.
4 A character represented by a particular value in the text depends on values provided earlier in the text stream, e.g., escape
sequences or bidirectional embedding controls.
 
 
Search WWH ::




Custom Search