Character Encodings - Beginning Java 8 Fundamentals

Java Reference

In-Depth Information

Table A-3. List of the Supported Character Encodings by a JVM

Character Encoding

Description

ASCII

7-bit ASCII (also known as ISO646-US, the Basic Latin block of the Unicode character set)

ISO-8859-1

ISO Latin Alphabet No. 1 (also known as ISO-LATIN-1)

UTF-8

8-bit Unicode Transformation Format

UTF-16BE

16-bit Unicode Transformation Format, big-endian byte order. Big-endian was discussed

in Chapter 3

UTF-16LE

16-bit Unicode Transformation Format, little-endian byte order. Little-endian was

discussed in Chapter 3

UTF-16

16-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-

order mark (either order accepted on input, big-endian used on output)

Java supports UTF-8 format with the following two significant modifications:

•

Java uses 16 bits to represent a NUL character in a class file whereas standard UTF-8 uses only

8 bits. This compromise has been made to make it easier for other languages to parse a Java

class file where a NUL character is not allowed within a string. However, in some cases Java

uses standard UTF-8 format to represent NUL character.

•

Java recognizes only 1-octet, 2-octet, and 3-octet UTF-8 formats whereas standard UTF-8

format may use 1-octet, 2-octet, 3-octet, 4-octet, 5-octet, and 6-octet sequences. This

is because Java supports Unicode character set and all characters from Unicode can be

represented in 1-, 2- or 3-octet formats of UTF-8.

When you compile the Java source code, by default, the Java compiler assumes that the source code file has been

written using the platform's default encoding (also known as local codepage or native encoding). The platform's

default character encoding is Latin-1 on Windows and Solaris and MacRoman on Mac. Note that Windows does not

use true Latin-1 character encoding. It uses a variation of Latin-1 that includes fewer control characters and more

printing characters .You can specify a file-encoding name (or codepage name) to control how the compiler interprets

characters beyond the ASCII character set. At the time of compiling your Java source code, you can pass the character-

encoding name used in your source code file to Java compiler. The following command tells Java compiler ( javac )

that the Java source code Test.java has been written using a traditional Chinese encoding named Big5 . Now, the Java

compiler will convert all characters encoded in Big5 to Unicode.

javac -encoding Big5 Test.java

The JDK includes the native2ascii tool, which can be used to convert files, which contain other character

encoding into files containing Latin-1 and/or Unicode-encoded characters. The general syntax of using native2ascii

tool is

native2ascii option inputfile outputfile

For example, the following command converts all characters in Source.java file into Unicode-encoded character

and places the output in Destination.java file assuming that the Source.java file has been written using the

platform's default encoding:

native2ascii Source.java Destination.java

Beginning Java 8 Fundamentals

Search WWH ::

Custom Search

Home