New Input/Output - Beginning Java 8 Language Features

Java Reference

In-Depth Information

Character Set

A character is not always stored in one byte. The number of bytes used to store a character depends on the coded

character set and the character-encoding scheme. A coded-character set is a mapping between a set of abstract

characters and a set of integers. A character-encoding scheme is a mapping between a coded-character set and a set

of octet sequence. Please refer to Appendix A in Beginning Java Fundamentals (ISBN: 978-1-4302-6652-5) for more

details on character set and character encoding.

An instance of the java.nio.charset.Charset class represents a character set and a character-encoding scheme

in a Java program. Examples of some character set names are US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE,

and UTF-16.

The process of converting a character into a sequence of bytes based on an encoding scheme is called character

encoding . The process of converting a sequence of bytes into a character based on an encoding scheme is called

decoding .

In NIO, you have the ability to convert a Unicode character to a sequence of bytes and vice versa using an

encoding scheme. The java.nio.charset package provides classes to encode/decode a CharBuffer to a ByteBuffer

and vice versa. An object of the Charset class represents the encoding scheme. The CharsetEncoder class performs

the encoding. The CharsetDecoder class performs the decoding. You can get an object of the Charset class using its

forName() method by passing the name of the character set as its argument.

The String and InputStreamReader classes support character encoding and decoding. When you use

str.getBytes("UTF-8") , you are encoding the Unicode-characters stored in the string object str to a sequence of

bytes using the UTF-8 encoding-scheme. When you use the constructor of the String class String(byte[] bytes,

Charset charset) to create a String object, you are decoding the sequence of bytes in the bytes array from the

specified character set to the Unicode-character set. You are also decoding a sequence of bytes from an input stream

into Unicode-characters when you create an object of the InputStreamReader class using a character set.

For simple encoding and decoding tasks, you can use the encode() and decode() methods of the Charset class.

Let's encode a sequence of characters in the string Hello stored in a character buffer and decode it using the UTF-8

encoding-scheme. The snippet of code to achieve this is as follows:

// Get a Charset object for UTF-8 encoding

Charset cs = Charset.forName("UTF-8");

// Character buffer to be encoded

CharBuffer cb = CharBuffer.wrap("Hello");

// Encode character buffer into a byte buffer

ByteBuffer encodedData = cs.encode(cb);

// Decode the byte buffer back to a character buffer

CharBuffer decodedData = cs.decode(encodedData);

The encode() and decode() methods of the Charset class are easy to use. However, they cannot be used in all

situations. They require you to know the inputs in advance. Sometimes you do not know the data to be encoded/

decoded in advance.

CharsetEncoder and CharsetDecoder classes provide much more power during the encoding and decoding

process. They accept a chunk of input to be encoded or decoded. The encode() and decode() methods of the Charset

class return the encoded and decoded buffers to you. However, CharsetEncoder and CharsetDecoder will let you

use your buffers for input and output data. The power comes with a little complexity! If you want more powerful

encoding/decoding, you will need to use the following five classes instead of just the Charset class:

Charset

•

CharsetEncoder

•

Beginning Java 8 Language Features

Search WWH ::

Custom Search

Home