Internationalization - Web Standards: Mastering HTML5, CSS3, and XML

HTML and CSS Reference

In-Depth Information

The set of supported characters depends on the character encoding, which is usually one of the following:

• UTF : UTF-8/UTF-16/UTF-32 (Unicode, worldwide)

• ISO standards : ISO-8859-1 (Western Europe), ISO-8859-2 (Central Europe), ISO-8859-3

(Southern Europe), ISO-8859-4 (Northern Europe), ISO-8859-5 (Cyrillic), ISO-8859-6-i

(Arabic), ISO-8859-7 (Greek), ISO-8859-8 (Hebrew, visual), ISO-8859-8-i (Hebrew, logical),

ISO-8859-9 (Turkish), ISO-8859-10 (Latin 6), ISO-8859-11 (Latin/Thai), ISO-8859-13 (Latin 7,

Baltic Rim), ISO-8859-14 (Latin 8, Celtic), ISO-8859-15 (Latin 9), ISO-8859-16 (Latin 10), ISO-

2022-jp (Japanese, e-mails), ISO-ir-111 (Cyrillic KOI-8)

• US-ASCII (basic English)

• Windows : Windows-1250 (Central Europe), Windows-1251 (Cyrillic), Windows-1252 (Western

Europe), Windows-1253 (Greek), Windows-1254 (Turkish), Windows-1255 (Hebrew),

Windows-1256 (Arabic), Windows-1257 (Baltic Rim)

• Encodings for eastern languages : EUC-JP (Japanese, Unix), Shift_JIS (Japanese, Win/Mac),

EUC-kr (Korean), gb2312 (Chinese, simplified), gb18030 (Chinese, simplified), big5 (Chinese,

traditional), Big5-HKSCS (Chinese, Hong Kong), tis-620 (Thai)

• Other : koi8-r (Russian), koi8-u (Ukrainian), Macintosh (MacRoman), and so on.

In spite of this wide variety, only the variants of a single character encoding—Unicode—should be used unless

there is a very good reason not to do so.

Unicode

Unicode is a standard for universal character encoding, which is capable of representing all characters of the written

languages of the world [1]. Beyond the characters of natural languages and widely used notations, all historic scripts

of the world are also covered. Unicode provides codes for approximately 137,000 characters covering 122 scripts (even

historic ones such as Egyptian hieroglyphs), including alphabets, ideograph sets, and symbols. Moreover, the Unicode

codespace supports more than a million codepoints. The Unicode Character Code Charts provide quick access to any

characters and their codepoints [2]. These classifications also give an insight into the wonderful richness of languages

and fields supported by Unicode:

•

Scripts

• European scripts : Armenian (including ligatures), Coptic (including Coptic in Greek

block), Cypriot syllabary, Cyrillic, Georgian, Glagolitic, Gothic, Greek, Latin (extended,

including ligatures and fullwidth Latin letters), Linear B (with syllabary and ideograms),

Ogham, Old Italic, Phaistos Disc, Runic, and Shavian

• Phonetic symbols : IPA extensions, phonetic extensions, modifier tone letters, spacing

modifier letters, superscripts and subscripts

• Combining diacritics : Combining diacritical marks and combining half marks

• African scripts : Bamum, Egyptian hieroglyphs, Ethiopic, N'Ko, Osmanya, Tifinagh, and Vai

• Middle Eastern scripts : Arabic, Imperial Aramaic, Avestan, Carian, Cuneiform (including

numbers and punctuation, Old Persian, and Ugaritic), Hebrew, Lycian, Lydian, Mandaic,

Old South Arabian, inscriptional Pahlavi, inscriptional Parthian, Phoenician, Samaritan,

and Syriac

Search WWH ::

Custom Search

Home