Java Reference
In-Depth Information
Did You Know?
ASCII and Unicode
We store data on a computer as binary numbers (sequences of 0s and 1s). To store
textual data, we need an encoding scheme that will tell us what sequence of 0s and
1s to use for any given character. Think of it as a giant secret decoder ring that says
things like, “If you want to store a lowercase 'a,' use the sequence 01100001.”
In the early 1960s IBM developed an encoding scheme called EBCDIC that
worked well with the company's punched cards, which had been in use for
decades before computers were even invented. But it soon became clear that
EBCDIC wasn't a convenient encoding scheme for computer programmers.
There were gaps in the sequence that made characters like 'i' and 'j' appear
far apart even though they follow one directly after the other.
In 1967 the American Standards Association published a scheme known as ASCII
(pronounced “AS-kee”) that has been in common use ever since. The acronym is
short for “American Standard Code for Information Interchange.” In its original form,
ASCII defined 128 characters that each could be stored with 7 bits of data.
The biggest problem with ASCII is that it is an American code. There are many
characters in common use in other countries that were not included in ASCII. For
example, the British pound (£) and the Spanish variant of the letter n (ñ) are not
included in the standard 128 ASCII characters. Various attempts have been made
to extend ASCII, doubling it to 256 characters so that it can include many of these
special characters. However, it turns out that even 256 characters is simply not
enough to capture the incredible diversity of human communication.
Around the time that Java was created, a consortium of software professionals
introduced a new standard for encoding characters known as Unicode. They decided
that the 7 bits of standard ASCII and the 8 bits of extended ASCII were simply not
big enough and chose not to set a limit on how many bits they might use for encod-
ing characters. At the time of this writing, the consortium has identified over 100,000
characters, which require a little over 16 bits to store. Unicode includes the characters
used in most modern languages and even some ancient languages. Egyptian hiero-
glyphs were added in 2007, although it still does not include Mayan hieroglyphs, and
the consortium has rejected a proposal to include Klingon characters.
The designers of Java used Unicode as the standard for the type char , which
means that Java programs are capable of manipulating a full range of characters.
Fortunately, the Unicode Consortium decided to incorporate the ASCII encod-
ings, so ASCII can be seen as a subset of Unicode. If you are curious about the
actual ordering of characters in ASCII, type “ASCII table” into your favorite
search engine and you will find millions of hits to explore.
 
Search WWH ::




Custom Search