Central Asian scripts : Mongolian, Old Turkic, Phags-Pa, and Tibetan
South Asian scripts : Bengali, Brahmi, Devanagari, Gujarati, Gurmukhi, Kaithi, Kannada,
Kharoshthi, Lepcha, Limbu, Malayalam, Meetei Mayek, Ol Chiki, Oriya, Saurashtra,
Sinhala, Syloti Nagri, Tamil, Telugu, Thaana, and Vedic extensions
Southeast Asian scripts : Batak, Balinese, Buginese, Cham, Javanese, Kayah Li, Khmer (with
symbols), Lao, Myanmar (extended), New Tai Lue, Rejang, Sundanese, Tai Le, Tai Tham,
Tai Viet, and Thai
Philippine scripts : Buhid, Hanunoo, Tagalog, and Tagbanwa
East Asian scripts : Bopomofo (extended), CJK unified ideographs (Han, extended),
CJK compatibility ideographs (with supplement), CJK / KangXi radicals, Hangul
Jamo (extended) and syllables, Hiragana, Katakana (with phonetic extensions, Kana
supplement, and half-width Katakana), Kanbun, Lisu, and Yi (with syllables and radicals)
American scripts : Cherokee, Deseret, and Unified Canadian Aboriginal Syllabics
Other scripts : Alphabetic presentation forms, half-width and full-width forms, and ASCII
Symbols and punctuation
Punctuation : General punctuation (ASCII punctuation, Latin-1 punctuation, small form
variants), supplemental punctuation (CJK symbols and punctuation, CJK compatibility
forms, full-width ASCII punctuation, and vertical forms)
Alphanumeric symbols : Letterlike symbols (including Roman symbols), mathematical
alphanumeric symbols, enclosed alphanumerics, enclosed CJK letters and months, CJK
compatibility symbols (including additional squared symbols)
Numbers and digits : Aegean numbers, Ancient Greek numbers, ASCII digits (including
fullwidth ASCII digits), common Indic number forms, counting Rod numerals, Cuneiform
numbers and punctuation, number forms, Rumi numeral symbols, superscripts, and
Mathematical symbols : Arrows, mathematical alphanumeric symbols, mathematical
operators, and geometric shapes
Other symbols : Alchemical symbols, ancient symbols, Braille patterns, and currency
symbols, dingbats, emoticons, game symbols, miscellaneous symbols, musical symbols
(including Ancient Greek musical notation and Byzantine musical symbols), transport
and map symbols, and Yijing symbols
Special characters : Layout controls, invisible operators, tags, and variation selectors
The standard supports three encoding forms (UTF-8, UTF-16, UTF-32) that use a common repertoire of
characters. They support the same data transmission but in 8, 16, or 32 bits per code unit format, respectively (byte,
word, or double word). They can even be transformed into one another. All three encoding forms need a maximum of
4 bytes (32 bits) of data for each character. Depending on the encoding form chosen (UTF-8, UTF-16, or UTF-32), each
character is represented as a sequence of either one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit
code unit. Since UTF-8 and UTF-16 are variable-width encodings, UTF-8 results in smaller file size for English texts.
However, UTF-8 requires 3 bytes for an Asian character for which UTF-16 requires only 2 bytes. UTF-32 codepoint
calculations can be performed quickly, but all codepoints require 4 bytes (fixed-width encoding).
