Java Reference
In-Depth Information
3.1
Encoding Information
What if a client program needs to obtain quote information from a vendor program? The two
programs must agree on how the information contained in the ItemQuote will be represented
as a sequence of bytes “on the wire”—sent over a TCP connection or carried in a UDP datagram.
(Note that everything in this chapter also applies if the “wire” is a file that is written by one
program and read by another.) In our example, the information to be represented consists of
primitive types (integers, booleans) and a character string.
Transmitting information via the network in Java requires that it be written to an Out−
putStream (of a Socket ) or encapsulated in a DatagramPacket (which is then sent via a Data−
gramSocket ). However, the only data types to which these operations can be applied are byte s
and arrays of byte s. As a strongly typed language, Java requires that other types— String , int ,
and so on—be explicitly converted to these transmittable types. Fortunately, the language has
a number of built-in facilities that make such conversions more convenient. Before dealing
with the specifics of our example, however, we focus on some basic concepts of representing
information as sequences of bytes for transmission.
3.1.1 Text
Old-fashioned text—strings of printable (displayable) characters—is perhaps the most com-
mon form of information representation. When the information to be transmitted is natural
language, text is the most natural representation. Text is convenient for other forms of in-
formation because humans can easily deal with it when printed or displayed; numbers, for
example, can be represented as strings of decimal digits.
To send text, the string of characters is translated into a sequence of bytes according
to a character set . The canonical example of a character encoding system is the venerable
American Standard Code for Information Interchange (ASCII), which defines a one-to-one
mapping between a set of the most commonly used printable characters in English, and binary
values. For example, in ASCII the digit 0 is represented by the byte value 48, 1 by 49, and so
on up to 9 , which is represented by the byte value 57. ASCII is adequate for applications that
only need to exchange English text. As the economy becomes increasingly globalized, however,
applications need to deal with other languages, including many that use characters for which
ASCII has no encoding, and even some (e.g., Chinese) that use more than 256 characters and
thus require more than 1 byte per character to encode. Encodings for the world's languages are
defined by companies and by standards bodies. Unicode is the most widely recognized such
character encoding; it is standardized by the International Organization for Standardization
(ISO).
Fortunately, Java provides good support for internationalization, in several ways. First,
Java uses Unicode to represent characters internally. Unicode defines a 16-bit (2-byte) code
for each character and thus supports a much larger set of characters than ASCII. In fact, the
Unicode standard currently defines codes for over 49,000 characters and covers “the principal
written languages and symbol systems of the world.” [21] Second, Java supports various other
standard encodings and provides a clean separation between its internal representation and
the encoding used when characters are input or output.
Search WWH ::




Custom Search