Internationalization - Web Standards: Mastering HTML5, CSS3, and XML

HTML and CSS Reference

In-Depth Information

Table 2-4. The Most Important Formatting Characters That Can Also Be Used for Markup [9]

Codepoint(s)

Name or Function

Comment

U+00A0

Nonbreakable space

Line break control.

U+00AD

Soft hyphen

Line break control.

U+200B

Zero-width space

Line break control.

U+200C .. U+200D

Zero-width join controls (ZWJ and ZWNJ)

Required for Persian and many Indic scripts.

U+200E .. U+200F

Implicit directional marks (LRM and RLM)

LRM and RLM are allowed.

U+2011

Nonbreaking hyphen

Line break control.

U+2044

Fraction slash

Alternatively, MathML markup can be used.

U+2060

Word joiner

This should be used for word joiner instead of

U+FEFF (ZWNBSP).

U+2061 .. U+2064

Invisible mathematical operators

Mathematical use.

U+2FF0 .. U+2FFB

Ideographic character description

Graphic characters (not controls).

U+303E

Ideographic variation indicator

Graphic character (not a control).

FE00 .. FE0F

Variation selectors

Modify graphic characters.

E0100 .. E01DF

Variation selectors

Modify graphic characters.

Special Characters

Certain Unicode characters deserve extended attention because they should be used with caution.

The Byte-Order Mark (BOM)

Unicode files can contain special bytes at the very beginning known as the byte-order mark (BOM). This codepoint is

the U+FEFF (Zero-width non-breaking space, ZWNBSP). As mentioned earlier, the byte order of UTF-16 and UTF-32

encoded files should be declared, and the BOM provides this information.

In UTF-16, the 2 or 4 bytes of characters can be ordered in two ways (little-endian or big-endian—defining the

direction the bytes should be read in). To choose from the two, documents encoded in UTF-16 should always start

with the BOM. In UTF-8, the BOM is optional since there are no alternate byte sequences, but if it is still provided, it

is called the UTF-8 signature . According to the I18N Activity Group at W3C, the byte-order mark should be omitted in

UTF-8 [10], mainly because it could cause display problems in some browsers. Typically it produces an extra line or

unwanted characters at the top of the page [11]. An advanced text editor or Richard Ishida's UTF-8 BOM tester [12] can

be used to check the presence of UTF-8 signatures.

Whitespace Characters

Some Unicode characters are (invisible) whitespace characters that have different line-breaking properties,

different ligating properties, and different widths. These characters are used to separate different parts of the

document with line breaks, tabulators, and spaces. They represent horizontal or vertical spaces on web pages and

contribute to the appearance and layout of content blocks or the entire page. Whitespace characters are typically

Web Standards: Mastering HTML5, CSS3, and XML

Search WWH ::

Custom Search

Home