Internationalization - Web Standards: Mastering HTML5, CSS3, and XML

HTML and CSS Reference

In-Depth Information

not visible but reserve some space when rendered. The list of whitespace characters varies from context to context.

For example, the form feed control character is considered as whitespace in HTML but not in XML. Each markup

language defines those few whitespace characters that can be applied as part of the markup syntax. The XML

specification defines whitespace as a combination of one or more of the following characters: space ( U+0020 ),

carriage return ( U+000D ), line feed ( U+000A ), or tab ( U+0009 ). HTML 4.01 also supports the form feed character

( U+000C ) which cannot be used in XHTML.

Not all whitespace characters can be typed in from the keyboard, although the most common ones, such as a

blank space (the basic word divider in Western languages) or a single tabulator, can be typed using the spacebar and

the Tab key, respectively. Advanced text editors usually provide inserting options for whitespaces (see the later section

“Development Tools”).

A very bad practice from the 1990s is to provide whitespaces for typography or layout by embedding blank

images, such as 1×1 pixel spacer.gif files, instead of whitespace characters, margins, or paddings. The biggest

disadvantage of this technique is the lack of structure or semantic meaning in the markup. Such images also have

a negative effect on searchability and accessibility (text browsers and screen readers would read aloud “spacer.gif ”

repeatedly). Another huge problem with spaceholder images is that even the slightest changes in the markup can

completely destroy the site layout.

NFC Normalization Is Recommended

In Unicode the same text can be provided with different character sequences. The accentuated a (in other words, á ),

for example, can be represented either as the pre-composed U+00E1 (Latin small letter a with acute) or as the decomposed

sequence of U+0061 (Latin small letter a) and U+0301 (Combining acute accent).

The Unicode standard supports four normalization forms : NFC , NFD , NFKC , and NFKD where C stands for

composed (precomposed), D for decomposed, and K represents compatibility.

The normalization form is especially important when accents or other diacritics are used in (X)HTML identifiers

or CSS selectors and class names. If such a word is used in precomposed form in the HTML (for example,

<div id="hangsúlyos"> ), but in decomposed form in the CSS (for example, #hangsúlyos { color: red; } ), then

the selector won't match the class name. This problem can be avoided by completely eliminating accented characters

in markup attributes and CSS properties, and use standard English characters only, which is the best practice.

W3C recommends NFC normalization—which is supported by advanced text editors by default—on the Web to

improve interoperability [13].

Unicode Should Be Preferred

Web pages should use one character encoding at a time. Different parts of the same document should not be encoded

with different encoding schemes.

UTF-8 character encoding can simplify multilingual sites. Unicode allows more languages to be used on a

single page than any other encoding system, which makes it ideal for content, forms, scripts, and databases. Due

to its powerful features, Unicode should be used wherever possible [14]. Thanks to the increasing popularity

of HTML5 templates and best practices, web designers tend to use UTF-8 for all their projects. The global

distribution of UTF-8 eliminates incorrect automatic encoding detection in browsers rendering documents with

special characters.

Using Unicode does not guarantee that texts will be displayed correctly in browsers. Several scripting languages

such as Arabic require additional techniques to ensure the appropriate character sequence of glyphs.

Search WWH ::

Custom Search

Home