Hardware Reference
In-Depth Information
ASCII
One widely used code is called ASCII ( American Standard Code for Infor-
mation Interchange ). Each ASCII character has 7 bits, allowing for 128 charac-
ters in all. However, because computers are byte oriented, each ASCII character is
normally stored in a separate byte. Figure 2-44 shows the ASCII code. Codes 0 to
1F (hexadecimal) are control characters and do not print. Codes from 128 to 255
are not part of ASCII, but the IBM PC defined them to be special characters like
smiley faces and most computers still support them.
Many of the ASCII control characters are intended for data transmission. For
example, a message might consist of an SOH (Start of Header) character, a header,
an STX (Start of Text) character, the text itself, an ETX (End of Text) character,
and then an EOT (End of Transmission) character. In practice, however, the mes-
sages sent over telephone lines and networks are formatted quite differently, so the
ASCII transmission control characters are not used much any more.
The ASCII printing characters are straightforward. They include the upper-
and lowercase letters, digits, punctuation marks, and a few math symbols.
Unicode
The computer industry grew up mostly in the U.S., which led to the ASCII
character set. ASCII is fine for English but less fine for other languages. French
needs accents (e.g., syst`me); German needs diacritical marks (e.g., fur), and so on.
Some European languages have a few letters not found in ASCII, such as the Ger-
man ß and the Danish / . Some languages have entirely different alphabets (e.g.,
Russian and Arabic), and a few languages have no alphabet at all (e.g., Chinese).
As computers spread to the four corners of the globe and software vendors want to
sell products in countries where most users do not speak English, a different char-
acter set is needed.
The first attempt at extending ASCII was IS 646, which added another 128
characters to ASCII, making it an 8-bit code called Latin-1 . The additional char-
acters were mostly Latin letters with accents and diacritical marks. The next at-
tempt was IS 8859, which introduced the concept of a code page , a set of 256
characters for a particular language or group of languages. IS 8859-1 is Latin-1.
IS 8859-2 handles the Latin-based Slavic languages (e.g., Czech, Polish, and Hun-
garian). IS 8859-3 contains the characters needed for Turkish, Maltese, Esperanto,
and Galician, and so on. The trouble with the code-page approach is that the soft-
ware has to keep track of which page it is currently on, it is impossible to mix lan-
guages over pages, and the scheme does not cover Japanese and Chinese at all.
A group of computer companies decided to solve this problem by forming a
consortium to create a new system, called Unicode , and getting it proclaimed an
International Standard (IS 10646). Unicode is now supported by programming
languages (e.g., Java), operating systems (e.g., Windows), and many applications.
 
Search WWH ::




Custom Search