Java Reference
In-Depth Information
Character Set
A character is not always stored in one byte. The number of bytes used to store a character depends on the coded
character set and the character-encoding scheme. A coded-character set is a mapping between a set of abstract
characters and a set of integers. A character-encoding scheme is a mapping between a coded-character set and a set
of octet sequence. Please refer to Appendix A in Beginning Java Fundamentals (ISBN: 978-1-4302-6652-5) for more
details on character set and character encoding.
An instance of the java.nio.charset.Charset class represents a character set and a character-encoding scheme
in a Java program. Examples of some character set names are US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE,
and UTF-16.
The process of converting a character into a sequence of bytes based on an encoding scheme is called character
encoding . The process of converting a sequence of bytes into a character based on an encoding scheme is called
decoding .
In NIO, you have the ability to convert a Unicode character to a sequence of bytes and vice versa using an
encoding scheme. The java.nio.charset package provides classes to encode/decode a CharBuffer to a ByteBuffer
and vice versa. An object of the Charset class represents the encoding scheme. The CharsetEncoder class performs
the encoding. The CharsetDecoder class performs the decoding. You can get an object of the Charset class using its
forName() method by passing the name of the character set as its argument.
The String and InputStreamReader classes support character encoding and decoding. When you use
str.getBytes("UTF-8") , you are encoding the Unicode-characters stored in the string object str to a sequence of
bytes using the UTF-8 encoding-scheme. When you use the constructor of the String class String(byte[] bytes,
Charset charset) to create a String object, you are decoding the sequence of bytes in the bytes array from the
specified character set to the Unicode-character set. You are also decoding a sequence of bytes from an input stream
into Unicode-characters when you create an object of the InputStreamReader class using a character set.
For simple encoding and decoding tasks, you can use the encode() and decode() methods of the Charset class.
Let's encode a sequence of characters in the string Hello stored in a character buffer and decode it using the UTF-8
encoding-scheme. The snippet of code to achieve this is as follows:
// Get a Charset object for UTF-8 encoding
Charset cs = Charset.forName("UTF-8");
// Character buffer to be encoded
CharBuffer cb = CharBuffer.wrap("Hello");
// Encode character buffer into a byte buffer
ByteBuffer encodedData = cs.encode(cb);
// Decode the byte buffer back to a character buffer
CharBuffer decodedData = cs.decode(encodedData);
The encode() and decode() methods of the Charset class are easy to use. However, they cannot be used in all
situations. They require you to know the inputs in advance. Sometimes you do not know the data to be encoded/
decoded in advance.
CharsetEncoder and CharsetDecoder classes provide much more power during the encoding and decoding
process. They accept a chunk of input to be encoded or decoded. The encode() and decode() methods of the Charset
class return the encoded and decoded buffers to you. However, CharsetEncoder and CharsetDecoder will let you
use your buffers for input and output data. The power comes with a little complexity! If you want more powerful
encoding/decoding, you will need to use the following five classes instead of just the Charset class:
Charset
CharsetEncoder
 
Search WWH ::




Custom Search