Unicode, Internationalization, and Currency Codes - Java 8 Recipes

Java Reference

In-Depth Information

if (legacySJIS.length == toSJIS.length) {

for (int x=0; x< legacySJIS.length; x++) {

if(legacySJIS[x] != toSJIS[x]) break;

}

same = true;

}

System.out.printf("Same: %s\n", same.toString());

As expected, the output indicates that the round-trip conversion back to the legacy

encoding was successful. The original byte array and the converted byte array contain

the same bytes:

Same: true

How It Works

The Java platform provides conversion support for many legacy character set encod-

ings. When you create a String instance from a byte array, you must provide a

charset argument to the String constructor so that the platform knows how to per-

form the mapping from the legacy encoding to Unicode. All Java strings use Unicode

as their native encoding.

The number of bytes in the original array does not usually equal the number of

characters in the result string. In this recipe's example, the original array contains 18

bytes. The 18 bytes are needed by the Shift-JIS encoding to represent the Japanese text.

However, after conversion, the result string contains nine characters. There is not a 1:1

relationship between bytes and characters. In this example, each character requires two

bytes in the original Shift-JIS encoding.

There are literally hundreds of different charset encodings. The number of en-

codings is dependent on your Java platform implementation. However, you are guaran-

teed support of several of the most common encodings, and your platform most likely

contains many more than this minimal set:

•

US-ASCII

•

ISO-8859-1

•

UTF-8

•

UTF-16BE

Search WWH ::

Custom Search

Home