HTML and CSS Reference
In-Depth Information
Chapter 2
Internationalization
Web documents are published in all languages of the world, using a variety of character repertoires and features such
as text direction. Several technologies support multilingual websites. To display characters correctly on websites, a
character encoding that supports the required characters should be used to encode the markup files. The character
encoding should be properly declared in the document header, and the documents served with proper server settings.
Capable of representing any characters and ideographs of all natural languages, both ancient and modern, Unicode
can be considered as the ultimate character encoding. To use Unicode, you need to understand the byte-order
marks which provide information about the ordering of individually addressable subcomponents of this multibyte
character encoding. Special characters and symbols can be written in various ways from entity sets and escape codes
to hexadecimal notation.
In this chapter, you will learn how to ensure correct character rendering on web sites, and use the same markup
structures for different language versions of multilingual sites. While the many characters are supported by more than
one character encoding system, Unicode should always be used unless you have a very good reason not to do so. Most
characters can be typed in directly into the markup, but there are some exceptions too. You will also learn the proper
application of character entities and whitespace characters to add special characters to web sites, such as invisible,
unprintable control characters.
The Importance of Character Encoding
Until the mid-1990s, computers mainly supported the characters of the English alphabet only (partly because of the
American dominance on the computer market), and the need for international characters has been satisfied through
hardware code pages, such as CP852 or CP1252, supported by the then-used operating systems (for example, DOS,
Windows 3.1, and Windows 95). The proper display of Central-European characters, for example, depended on the
hardware configuration, the operating system, and the settings of the operating system. A few years later, with the
introduction of the Web, such limitations were no longer acceptable. In 1997, HTML 4.0 added advanced support for
international characters.
The American Standard Code for Information Interchange (ASCII) has been the most widely supported character
encoding scheme, which stores 128 characters on 7 bits. Additional characters have been provided by 8-bit character sets,
such as the ISO/IEC 8859 series of ASCII-based standard character encodings (informally referred to as Latin-1). They were
first published in 1987 and supported most Western European languages and partly supported some other languages.
Most modern character encoding systems are based on ASCII; however, they support many more characters.
If anything other than the most basic Latin characters is needed, many characters on your web site will be
incorrect unless an appropriate character encoding is specified. These standards define not only the identification of
each character and the associated numeric value ( codepoint 1 ), but also the way this value is represented in the bits of
the file to be encoded.
1 Codepoints are code positions that can be any of the numerical values that form the codespace of a character encoding.
 
Search WWH ::




Custom Search