Java Reference
In-Depth Information
Character Set
A character is not always stored in one byte. The number of bytes used to store a character depends on the coded
character set and the character-encoding scheme. A coded-character set is a mapping between a set of abstract
characters and a set of integers. A character-encoding scheme is a mapping between a coded-character set and a set
of octet sequence. Please refer to Appendix A in
Beginning Java Fundamentals
(ISBN: 978-1-4302-6652-5) for more
details on character set and character encoding.
An instance of the
java.nio.charset.Charset
class represents a character set and a character-encoding scheme
in a Java program. Examples of some character set names are US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE,
and UTF-16.
The process of converting a character into a sequence of bytes based on an encoding scheme is called
character
encoding
. The process of converting a sequence of bytes into a character based on an encoding scheme is called
decoding
.
In NIO, you have the ability to convert a Unicode character to a sequence of bytes and vice versa using an
encoding scheme. The
java.nio.charset
package provides classes to encode/decode a
CharBuffer
to a
ByteBuffer
and vice versa. An object of the
Charset
class represents the encoding scheme. The
CharsetEncoder
class performs
the encoding. The
CharsetDecoder
class performs the decoding. You can get an object of the
Charset
class using its
forName()
method by passing the name of the character set as its argument.
The
String
and
InputStreamReader
classes support character encoding and decoding. When you use
str.getBytes("UTF-8")
, you are encoding the Unicode-characters stored in the string object
str
to a sequence of
bytes using the UTF-8 encoding-scheme. When you use the constructor of the
String
class
String(byte[] bytes,
Charset charset)
to create a
String
object, you are decoding the sequence of bytes in the
bytes
array from the
specified character set to the Unicode-character set. You are also decoding a sequence of bytes from an input stream
into Unicode-characters when you create an object of the
InputStreamReader
class using a character set.
For simple encoding and decoding tasks, you can use the
encode()
and
decode()
methods of the
Charset
class.
Let's encode a sequence of characters in the string
Hello
stored in a character buffer and decode it using the UTF-8
encoding-scheme. The snippet of code to achieve this is as follows:
// Get a Charset object for UTF-8 encoding
Charset cs = Charset.forName("UTF-8");
// Character buffer to be encoded
CharBuffer cb = CharBuffer.wrap("Hello");
// Encode character buffer into a byte buffer
ByteBuffer encodedData = cs.encode(cb);
// Decode the byte buffer back to a character buffer
CharBuffer decodedData = cs.decode(encodedData);
The
encode()
and
decode()
methods of the
Charset
class are easy to use. However, they cannot be used in all
situations. They require you to know the inputs in advance. Sometimes you do not know the data to be encoded/
decoded in advance.
CharsetEncoder
and
CharsetDecoder
classes provide much more power during the encoding and decoding
process. They accept a chunk of input to be encoded or decoded. The
encode()
and
decode()
methods of the
Charset
class return the encoded and decoded buffers to you. However,
CharsetEncoder
and
CharsetDecoder
will let you
use your buffers for input and output data. The power comes with a little complexity! If you want more powerful
encoding/decoding, you will need to use the following five classes instead of just the
Charset
class:
Charset
•
CharsetEncoder
•