Java Reference
In-Depth Information
O-zone: It is used for Korean Hangul syllabic scripts and for other scripts, like ?. Its range is
A000-D7FF, so 14336 code positions are available in this zone.
S-zone: It is reserved for use with transformation format UTF-16. The transformation format
UTF-16 will be described shortly. Its range is D800-DFFF, so 2048 code positions are available
in this zone.
R-zone: It is known as the restricted zone. It can be used only in special circumstances. One of
the uses of this zone is for specific user-defined characters. However, in this case an agreement
is necessary between the sender and the recipient to communicate successfully. Its range is
E000-FFFD, so 8190 code positions are available in this zone.
UCS is closely related to another popular character set called Unicode, which has been prepared by the Unicode
Consortium. Unicode uses a 2-octet (16 bits) coding structure and hence it can accommodate 2 16 (= 65536) distinct
characters. The Unicode can be considered as the 16-bit coding of the BMP of UCS. These two character sets, Unicode
and UCS, were developed and are maintained by two different organizations. However, they cooperate to keep
Unicode and UCS compatible. If a computer system uses the Unicode character set to store some text, each character
in the text has to be allocated 16 bits even if all characters in the text are from ASCII character set. Note that the first
128 characters of Unicode match with that of ASCII, and a character in ASCII can be represented only in 8-bits. So,
to use 16 bits to represent all characters in Unicode is wasteful of computer memory. An alternative would be to
use 8 bits for all characters from ASCII and 16 bits for characters outside the range of ASCII. However, this method
of using different bits to represent different characters from Unicode has to be consistent and uniform, resulting
in no ambiguity when data is stored or interchanged between different computer systems. This issue led to the
development of character encoding method. Currently, there are four character-encoding methods specified in
ISO/IEC 10646-1.
UCS-2
UCS-4
UTF-16
UTF-8
UCS-2
This is a 2-octet BMP form of encoding, which allows the use of two octets to represent a character from the BMP. This
is a fixed-length encoding method. That is, each character from BMP is represented exactly by two octets.
UCS-4
This encoding method is also called the 4-octet canonical form of encoding, which uses four octets for every character
in UCS. This is also a fixed-length encoding method.
UTF-16 (UCS Transformation Format 16)
Once characters outside the BMP are used, the UCS-2 encoding method cannot be applied to represent them. In
this case, the encoding must switch over to use UCS-4, which will just double the use of resources, such as memory,
network bandwidth, etc. The transformation format UTF-16 has been designed to avoid such a waste of memory and
other resources, which would have resulted in using the UCS-4 encoding method. The UTF-16 is a variable-length
encoding method. In the UTF-16 encoding method, UCS-2 is used for all characters within BMP and UCS-4 is used for
encoding the characters outside BMP.
 
Search WWH ::




Custom Search