Java Reference
In-Depth Information
continued
standardization process started with standards such as EBCDIC (Extended Binary
Coded Decimal Interchange Code), a character encoding scheme used by many IBM
mainframes decades ago, and ASCII (American Standard Code for Information
Interchange), which quickly became the standard in the early days of computing.
However, as time progressed, many vendors were confronted with the limita-
tions of this American-centric standard, as no mapping was provided to repre-
sent accented characters (such as è, ï, and so on) used in non-English languages.
As such, the International Organization for Standardization (ISO) and the
International Electrotechnical Commission (IEC) proposed the ISO/IEC 8859
standard, a collection of character encodings to support European, Cyrillic, Greek,
Turkish, and other languages. As ASCII used only seven out of eight bits provided
in a byte (the remaining bit was sometimes used to calculate a checksum or used
by vendors to map custom characters), the implementation of this standard was
simple. By using the eighth bit, the range of possible characters that could be rep-
resented doubled and could thus include the accented characters. Moreover, this
also ensured that all the ASCII characters could still retain their original position,
which enabled backward-compatibility with existing text files.
Still, a downside of this approach was that users had to have the correct code
page installed and selected on their system in order to read text files correctly.
Whereas I might create a text file following the ISO/IEC 8859-1 convention (west-
ern Latin alphabet), the result will look wrong if you read the file using ISO/IEC
8859-7 (Greek). In addition, many languages were still not covered by the extend-
ing standards. Consider, for example, Asian regions, where a completely different
standardization process had been followed thus far (the Chinese National Standard
11643). As such, in recent years, the ISO and IEC set out to create another standard
called the Universal Character Set. This standard aims to represent all characters
from the many languages of the world. The implementation comes in the form of
the “Unicode” standard, the latest version of which contains a repertoire of more
than 100,000 characters covering 100 scripts and various symbols. This is the stan-
dard now in use by all modern operating systems and throughout the Web.
Various encodings have been defined in order to map this wealth of characters to
raw bits and bytes in a file. UTF-8 is the most common encoding format and it uses
one byte for any ASCII character, all of which have the same code values in both
UTF-8 and ASCII encoding (which is great news in terms of compatibility). For
other characters, up to four bytes can be used. (As such, UTF-8 is called a “vari-
able width” encoding.) Next, UCS-2 uses a two-byte code unit for each character,
and thus cannot encode every character in the Unicode standard. UTF-16 extends
UCS-2, using two-byte code units for the characters that were representable in
UCS-2 and four-byte code units to handle each of the additional characters.
This might all seem a bit overwhelming, but the good news is that in recent years,
thanks to the Unicode Consortium, things have become much simpler. The key
takeaways to keep in mind are:
Search WWH ::




Custom Search