Internationalization - Web Standards: Mastering HTML5, CSS3, and XML

HTML and CSS Reference

In-Depth Information

Chapter 2

Internationalization

Web documents are published in all languages of the world, using a variety of character repertoires and features such

as text direction. Several technologies support multilingual websites. To display characters correctly on websites, a

character encoding that supports the required characters should be used to encode the markup files. The character

encoding should be properly declared in the document header, and the documents served with proper server settings.

Capable of representing any characters and ideographs of all natural languages, both ancient and modern, Unicode

can be considered as the ultimate character encoding. To use Unicode, you need to understand the byte-order

marks which provide information about the ordering of individually addressable subcomponents of this multibyte

character encoding. Special characters and symbols can be written in various ways from entity sets and escape codes

to hexadecimal notation.

In this chapter, you will learn how to ensure correct character rendering on web sites, and use the same markup

structures for different language versions of multilingual sites. While the many characters are supported by more than

one character encoding system, Unicode should always be used unless you have a very good reason not to do so. Most

characters can be typed in directly into the markup, but there are some exceptions too. You will also learn the proper

application of character entities and whitespace characters to add special characters to web sites, such as invisible,

unprintable control characters.

The Importance of Character Encoding

Until the mid-1990s, computers mainly supported the characters of the English alphabet only (partly because of the

American dominance on the computer market), and the need for international characters has been satisfied through

hardware code pages, such as CP852 or CP1252, supported by the then-used operating systems (for example, DOS,

Windows 3.1, and Windows 95). The proper display of Central-European characters, for example, depended on the

hardware configuration, the operating system, and the settings of the operating system. A few years later, with the

introduction of the Web, such limitations were no longer acceptable. In 1997, HTML 4.0 added advanced support for

international characters.

The American Standard Code for Information Interchange (ASCII) has been the most widely supported character

encoding scheme, which stores 128 characters on 7 bits. Additional characters have been provided by 8-bit character sets,

such as the ISO/IEC 8859 series of ASCII-based standard character encodings (informally referred to as Latin-1). They were

first published in 1987 and supported most Western European languages and partly supported some other languages.

Most modern character encoding systems are based on ASCII; however, they support many more characters.

If anything other than the most basic Latin characters is needed, many characters on your web site will be

incorrect unless an appropriate character encoding is specified. These standards define not only the identification of

each character and the associated numeric value ( codepoint 1 ), but also the way this value is represented in the bits of

the file to be encoded.

1 Codepoints are code positions that can be any of the numerical values that form the codespace of a character encoding.

Search WWH ::

Custom Search

Home