Handling Input and Output - Beginning Java Programming: The Object-Oriented Approach

Java Reference

In-Depth Information

continued

standardization process started with standards such as EBCDIC (Extended Binary

Coded Decimal Interchange Code), a character encoding scheme used by many IBM

mainframes decades ago, and ASCII (American Standard Code for Information

Interchange), which quickly became the standard in the early days of computing.

However, as time progressed, many vendors were confronted with the limita-

tions of this American-centric standard, as no mapping was provided to repre-

sent accented characters (such as è, ï, and so on) used in non-English languages.

As such, the International Organization for Standardization (ISO) and the

International Electrotechnical Commission (IEC) proposed the ISO/IEC 8859

standard, a collection of character encodings to support European, Cyrillic, Greek,

Turkish, and other languages. As ASCII used only seven out of eight bits provided

in a byte (the remaining bit was sometimes used to calculate a checksum or used

by vendors to map custom characters), the implementation of this standard was

simple. By using the eighth bit, the range of possible characters that could be rep-

resented doubled and could thus include the accented characters. Moreover, this

also ensured that all the ASCII characters could still retain their original position,

which enabled backward-compatibility with existing text files.

Still, a downside of this approach was that users had to have the correct code

page installed and selected on their system in order to read text files correctly.

Whereas I might create a text file following the ISO/IEC 8859-1 convention (west-

ern Latin alphabet), the result will look wrong if you read the file using ISO/IEC

8859-7 (Greek). In addition, many languages were still not covered by the extend-

ing standards. Consider, for example, Asian regions, where a completely different

standardization process had been followed thus far (the Chinese National Standard

11643). As such, in recent years, the ISO and IEC set out to create another standard

called the Universal Character Set. This standard aims to represent all characters

from the many languages of the world. The implementation comes in the form of

the “Unicode” standard, the latest version of which contains a repertoire of more

than 100,000 characters covering 100 scripts and various symbols. This is the stan-

dard now in use by all modern operating systems and throughout the Web.

Various encodings have been defined in order to map this wealth of characters to

raw bits and bytes in a file. UTF-8 is the most common encoding format and it uses

one byte for any ASCII character, all of which have the same code values in both

UTF-8 and ASCII encoding (which is great news in terms of compatibility). For

other characters, up to four bytes can be used. (As such, UTF-8 is called a “vari-

able width” encoding.) Next, UCS-2 uses a two-byte code unit for each character,

and thus cannot encode every character in the Unicode standard. UTF-16 extends

UCS-2, using two-byte code units for the characters that were representable in

UCS-2 and four-byte code units to handle each of the additional characters.

This might all seem a bit overwhelming, but the good news is that in recent years,

thanks to the Unicode Consortium, things have become much simpler. The key

takeaways to keep in mind are:

Beginning Java Programming: The Object-Oriented Approach

Search WWH ::

Custom Search

Home