Java Reference
In-Depth Information
Writers and Readers
Java'sstreamclassesaregoodforstreamingsequencesofbytes,butthey'renotgoodfor
streamingsequencesofcharactersbecausebytesandcharactersaretwodifferentthings:
abyterepresentsan8-bitdataitemandacharacterrepresentsa16-bitdataitem.Also,
Java's char and String types naturally handle characters instead of bytes.
More importantly, byte streams have no knowledge of character sets (sets of map-
pings between integer values [known as code points ] and symbols, such as Unicode)
and their character encodings (mappings between the members of a character set and
sequences of bytes that encode these characters for efficiency, such as UTF-8).
Ifyouneedtostreamcharacters,youshouldtakeadvantageofJava'swriterandread-
erclasses,whichweredesignedtosupportcharacterI/O(theyworkwith char instead
of byte ).Furthermore,thewriterandreaderclassestakecharacterencodingsintoac-
count.
A BRIEF HISTORY OF CHARACTER SETS AND CHARACTER
ENCODINGS
EarlycomputersandprogramminglanguageswerecreatedmainlybyEnglish-speak-
ing programmers in countries where English was the native language. They deve-
lopedastandardmappingbetweencodepoints0through127andthe128commonly
used characters in the English language (e.g., A-Z). The resulting character set/en-
coding was named American Standard Code for Information Interchange (ASCII) .
TheproblemwithASCIIisthatit'sinadequateformostnon-Englishlanguages.For
example, ASCII doesn't support diacritical marks such as the cedilla used in the
Frenchlanguage.Becauseabytecanrepresentamaximumof256differentcharac-
ters, developers around the world started creating different character sets/encodings
thatencodedthe128ASCIIcharacters,butalsoencodedextracharacterstomeetthe
needsoflanguagessuchasFrench,Greek,orRussian.Overtheyears,manylegacy
(andstillimportant)fileshavebeencreatedwhosebytesrepresentcharactersdefined
by specific character sets/encodings.
TheInternationalOrganizationforStandardization(ISO)andtheInternationalElec-
trotechnicalCommission(IEC)haveworkedtostandardizetheseeight-bitcharacter
sets/encodings under a joint umbrella standard called ISO/IEC 8859. The result is
a series of substandards named ISO/IEC 8859-1, ISO/IEC 8859-2, and so on. For
example, ISO/IEC 8859-1 (also known as Latin-1) defines a character set/encoding
thatconsistsofASCIIplusthecharacterscoveringmostWesternEuropeancountries.
Search WWH ::




Custom Search