HTML and CSS Reference
In-Depth Information
The set of supported characters depends on the character encoding, which is usually one of the following:
UTF : UTF-8/UTF-16/UTF-32 (Unicode, worldwide)
ISO standards : ISO-8859-1 (Western Europe), ISO-8859-2 (Central Europe), ISO-8859-3
(Southern Europe), ISO-8859-4 (Northern Europe), ISO-8859-5 (Cyrillic), ISO-8859-6-i
(Arabic), ISO-8859-7 (Greek), ISO-8859-8 (Hebrew, visual), ISO-8859-8-i (Hebrew, logical),
ISO-8859-9 (Turkish), ISO-8859-10 (Latin 6), ISO-8859-11 (Latin/Thai), ISO-8859-13 (Latin 7,
Baltic Rim), ISO-8859-14 (Latin 8, Celtic), ISO-8859-15 (Latin 9), ISO-8859-16 (Latin 10), ISO-
2022-jp (Japanese, e-mails), ISO-ir-111 (Cyrillic KOI-8)
US-ASCII (basic English)
Windows : Windows-1250 (Central Europe), Windows-1251 (Cyrillic), Windows-1252 (Western
Europe), Windows-1253 (Greek), Windows-1254 (Turkish), Windows-1255 (Hebrew),
Windows-1256 (Arabic), Windows-1257 (Baltic Rim)
Encodings for eastern languages : EUC-JP (Japanese, Unix), Shift_JIS (Japanese, Win/Mac),
EUC-kr (Korean), gb2312 (Chinese, simplified), gb18030 (Chinese, simplified), big5 (Chinese,
traditional), Big5-HKSCS (Chinese, Hong Kong), tis-620 (Thai)
Other : koi8-r (Russian), koi8-u (Ukrainian), Macintosh (MacRoman), and so on.
In spite of this wide variety, only the variants of a single character encoding—Unicode—should be used unless
there is a very good reason not to do so.
Unicode is a standard for universal character encoding, which is capable of representing all characters of the written
languages of the world [1]. Beyond the characters of natural languages and widely used notations, all historic scripts
of the world are also covered. Unicode provides codes for approximately 137,000 characters covering 122 scripts (even
historic ones such as Egyptian hieroglyphs), including alphabets, ideograph sets, and symbols. Moreover, the Unicode
codespace supports more than a million codepoints. The Unicode Character Code Charts provide quick access to any
characters and their codepoints [2]. These classifications also give an insight into the wonderful richness of languages
and fields supported by Unicode:
European scripts : Armenian (including ligatures), Coptic (including Coptic in Greek
block), Cypriot syllabary, Cyrillic, Georgian, Glagolitic, Gothic, Greek, Latin (extended,
including ligatures and fullwidth Latin letters), Linear B (with syllabary and ideograms),
Ogham, Old Italic, Phaistos Disc, Runic, and Shavian
Phonetic symbols : IPA extensions, phonetic extensions, modifier tone letters, spacing
modifier letters, superscripts and subscripts
Combining diacritics : Combining diacritical marks and combining half marks
African scripts : Bamum, Egyptian hieroglyphs, Ethiopic, N'Ko, Osmanya, Tifinagh, and Vai
Middle Eastern scripts : Arabic, Imperial Aramaic, Avestan, Carian, Cuneiform (including
numbers and punctuation, Old Persian, and Ugaritic), Hebrew, Lycian, Lydian, Mandaic,
Old South Arabian, inscriptional Pahlavi, inscriptional Parthian, Phoenician, Samaritan,
and Syriac
