Java Reference
In-Depth Information
Table A-3. List of the Supported Character Encodings by a JVM
Character Encoding
Description
ASCII
7-bit ASCII (also known as ISO646-US, the Basic Latin block of the Unicode character set)
ISO-8859-1
ISO Latin Alphabet No. 1 (also known as ISO-LATIN-1)
UTF-8
8-bit Unicode Transformation Format
UTF-16BE
16-bit Unicode Transformation Format, big-endian byte order. Big-endian was discussed
in Chapter 3
UTF-16LE
16-bit Unicode Transformation Format, little-endian byte order. Little-endian was
discussed in Chapter 3
UTF-16
16-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-
order mark (either order accepted on input, big-endian used on output)
Java supports UTF-8 format with the following two significant modifications:
Java uses 16 bits to represent a NUL character in a class file whereas standard UTF-8 uses only
8 bits. This compromise has been made to make it easier for other languages to parse a Java
class file where a NUL character is not allowed within a string. However, in some cases Java
uses standard UTF-8 format to represent NUL character.
Java recognizes only 1-octet, 2-octet, and 3-octet UTF-8 formats whereas standard UTF-8
format may use 1-octet, 2-octet, 3-octet, 4-octet, 5-octet, and 6-octet sequences. This
is because Java supports Unicode character set and all characters from Unicode can be
represented in 1-, 2- or 3-octet formats of UTF-8.
When you compile the Java source code, by default, the Java compiler assumes that the source code file has been
written using the platform's default encoding (also known as local codepage or native encoding). The platform's
default character encoding is Latin-1 on Windows and Solaris and MacRoman on Mac. Note that Windows does not
use true Latin-1 character encoding. It uses a variation of Latin-1 that includes fewer control characters and more
printing characters .You can specify a file-encoding name (or codepage name) to control how the compiler interprets
characters beyond the ASCII character set. At the time of compiling your Java source code, you can pass the character-
encoding name used in your source code file to Java compiler. The following command tells Java compiler ( javac )
that the Java source code Test.java has been written using a traditional Chinese encoding named Big5 . Now, the Java
compiler will convert all characters encoded in Big5 to Unicode.
javac -encoding Big5 Test.java
The JDK includes the native2ascii tool, which can be used to convert files, which contain other character
encoding into files containing Latin-1 and/or Unicode-encoded characters. The general syntax of using native2ascii
tool is
native2ascii option inputfile outputfile
For example, the following command converts all characters in Source.java file into Unicode-encoded character
and places the output in Destination.java file assuming that the Source.java file has been written using the
platform's default encoding:
native2ascii Source.java Destination.java
 
Search WWH ::




Custom Search