Internationalization - Web Standards: Mastering HTML5, CSS3, and XML

HTML and CSS Reference

In-Depth Information

• Central Asian scripts : Mongolian, Old Turkic, Phags-Pa, and Tibetan

• South Asian scripts : Bengali, Brahmi, Devanagari, Gujarati, Gurmukhi, Kaithi, Kannada,

Kharoshthi, Lepcha, Limbu, Malayalam, Meetei Mayek, Ol Chiki, Oriya, Saurashtra,

Sinhala, Syloti Nagri, Tamil, Telugu, Thaana, and Vedic extensions

• Southeast Asian scripts : Batak, Balinese, Buginese, Cham, Javanese, Kayah Li, Khmer (with

symbols), Lao, Myanmar (extended), New Tai Lue, Rejang, Sundanese, Tai Le, Tai Tham,

Tai Viet, and Thai

• Philippine scripts : Buhid, Hanunoo, Tagalog, and Tagbanwa

• East Asian scripts : Bopomofo (extended), CJK unified ideographs (Han, extended),

CJK compatibility ideographs (with supplement), CJK / KangXi radicals, Hangul

Jamo (extended) and syllables, Hiragana, Katakana (with phonetic extensions, Kana

supplement, and half-width Katakana), Kanbun, Lisu, and Yi (with syllables and radicals)

• American scripts : Cherokee, Deseret, and Unified Canadian Aboriginal Syllabics

• Other scripts : Alphabetic presentation forms, half-width and full-width forms, and ASCII

characters

•

Symbols and punctuation

• Punctuation : General punctuation (ASCII punctuation, Latin-1 punctuation, small form

variants), supplemental punctuation (CJK symbols and punctuation, CJK compatibility

forms, full-width ASCII punctuation, and vertical forms)

• Alphanumeric symbols : Letterlike symbols (including Roman symbols), mathematical

alphanumeric symbols, enclosed alphanumerics, enclosed CJK letters and months, CJK

compatibility symbols (including additional squared symbols)

• Numbers and digits : Aegean numbers, Ancient Greek numbers, ASCII digits (including

fullwidth ASCII digits), common Indic number forms, counting Rod numerals, Cuneiform

numbers and punctuation, number forms, Rumi numeral symbols, superscripts, and

subscripts

• Mathematical symbols : Arrows, mathematical alphanumeric symbols, mathematical

operators, and geometric shapes

• Other symbols : Alchemical symbols, ancient symbols, Braille patterns, and currency

symbols, dingbats, emoticons, game symbols, miscellaneous symbols, musical symbols

(including Ancient Greek musical notation and Byzantine musical symbols), transport

and map symbols, and Yijing symbols

• Special characters : Layout controls, invisible operators, tags, and variation selectors

The standard supports three encoding forms (UTF-8, UTF-16, UTF-32) that use a common repertoire of

characters. They support the same data transmission but in 8, 16, or 32 bits per code unit format, respectively (byte,

word, or double word). They can even be transformed into one another. All three encoding forms need a maximum of

4 bytes (32 bits) of data for each character. Depending on the encoding form chosen (UTF-8, UTF-16, or UTF-32), each

character is represented as a sequence of either one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit

code unit. Since UTF-8 and UTF-16 are variable-width encodings, UTF-8 results in smaller file size for English texts.

However, UTF-8 requires 3 bytes for an Asian character for which UTF-16 requires only 2 bytes. UTF-32 codepoint

calculations can be performed quickly, but all codepoints require 4 bytes (fixed-width encoding).

Search WWH ::

Custom Search

Home