Java Reference
In-Depth Information
UTF-8 (UCS Transformation Format 8)
This is a variable-length encoding method, which may use 1 to 6 octets to represent a character from UCS. All ASCII
characters are encoded using one octet. In the UTF-8 format of character encoding, characters are represented using
one or more octets as shown in Table A-2 .
Table A-2. List of Legal UTF-8 Sequences
Number of Octets
Bit Patterns Used
UCS Code
1
Octet 1: 0xxxxxxx
00000000-0000007F
2
Octet 1: 110xxxxx
Octet 2: 10xxxxxx
00000080-000007FF
3
Octet 1: 1110xxxx
Octet 2: 10xxxxxx
Octet 3: 10xxxxxx
00000800-0000FFFF
4
Octet 1: 11110xxx
Octet 2: 10xxxxxx
Octet 3: 10xxxxxx
Octet 4: 10xxxxxx
00010000-001FFFFF
5
Octet 1: 111110xx
Octet 2: 10xxxxxx
Octet 3: 10xxxxxx
Octet 4: 10xxxxxx
Octet 5: 10xxxxxx
00200000-03FFFFFF
6
Octet 1: 1111110x
Octet 2: 10xxxxxx
Octet 3: 10xxxxxx
Octet 4: 10xxxxxx
Octet 5: 10xxxxxx
Octet 6: 10xxxxxx
04000000-7FFFFFFF
The “x” in the table indicates either a 0 or a 1. Note that, in UTF-8 format, an octet that starts with a 0 bit indicates
that it is representing an ASCII character. An octet starting with 110 bits combinations indicates that it is the first octet
of the 2-octet representation of a character, and so on. Also note that, in the case an octet is a part of a multi-octet
character representation, the octet other than the first one starts with a 10 bits pattern. Security checks can be easily
implemented for UTF-8 encoded data. UTF-8 octet sequences, which do not conform to the octet-sequences shown in
the table, are considered invalid.
Java and Character Encodings
Java stores and manipulates all characters and strings as Unicode characters. In serialization and byte codes, Java uses
the UTF-8 encoding of the Unicode character set. All implementations of Java virtual machine are required to support
the character encoding methods, as shown in Table A-3 .
 
Search WWH ::




Custom Search