HTML and CSS Reference
In-Depth Information
Table 2-4. The Most Important Formatting Characters That Can Also Be Used for Markup [9]
Name or Function
Nonbreakable space
Line break control.
Soft hyphen
Line break control.
Zero-width space
Line break control.
U+200C .. U+200D
Zero-width join controls (ZWJ and ZWNJ)
Required for Persian and many Indic scripts.
U+200E .. U+200F
Implicit directional marks (LRM and RLM)
LRM and RLM are allowed.
Nonbreaking hyphen
Line break control.
Fraction slash
Alternatively, MathML markup can be used.
Word joiner
This should be used for word joiner instead of
U+2061 .. U+2064
Invisible mathematical operators
Mathematical use.
U+2FF0 .. U+2FFB
Ideographic character description
Graphic characters (not controls).
Ideographic variation indicator
Graphic character (not a control).
FE00 .. FE0F
Variation selectors
Modify graphic characters.
E0100 .. E01DF
Variation selectors
Modify graphic characters.
Special Characters
Certain Unicode characters deserve extended attention because they should be used with caution.
The Byte-Order Mark (BOM)
Unicode files can contain special bytes at the very beginning known as the byte-order mark (BOM). This codepoint is
the U+FEFF (Zero-width non-breaking space, ZWNBSP). As mentioned earlier, the byte order of UTF-16 and UTF-32
encoded files should be declared, and the BOM provides this information.
In UTF-16, the 2 or 4 bytes of characters can be ordered in two ways (little-endian or big-endian—defining the
direction the bytes should be read in). To choose from the two, documents encoded in UTF-16 should always start
with the BOM. In UTF-8, the BOM is optional since there are no alternate byte sequences, but if it is still provided, it
is called the UTF-8 signature . According to the I18N Activity Group at W3C, the byte-order mark should be omitted in
UTF-8 [10], mainly because it could cause display problems in some browsers. Typically it produces an extra line or
unwanted characters at the top of the page [11]. An advanced text editor or Richard Ishida's UTF-8 BOM tester [12] can
be used to check the presence of UTF-8 signatures.
Whitespace Characters
Some Unicode characters are (invisible) whitespace characters that have different line-breaking properties,
different ligating properties, and different widths. These characters are used to separate different parts of the
document with line breaks, tabulators, and spaces. They represent horizontal or vertical spaces on web pages and
contribute to the appearance and layout of content blocks or the entire page. Whitespace characters are typically
Search WWH ::

Custom Search