Interacting with Filesystems - Beginning Java 7

Java Reference

In-Depth Information

Writers and Readers

Java'sstreamclassesaregoodforstreamingsequencesofbytes,butthey'renotgoodfor

streamingsequencesofcharactersbecausebytesandcharactersaretwodifferentthings:

abyterepresentsan8-bitdataitemandacharacterrepresentsa16-bitdataitem.Also,

Java's char and String types naturally handle characters instead of bytes.

More importantly, byte streams have no knowledge of character sets (sets of map-

pings between integer values [known as code points ] and symbols, such as Unicode)

and their character encodings (mappings between the members of a character set and

sequences of bytes that encode these characters for efficiency, such as UTF-8).

Ifyouneedtostreamcharacters,youshouldtakeadvantageofJava'swriterandread-

erclasses,whichweredesignedtosupportcharacterI/O(theyworkwith char instead

of byte ).Furthermore,thewriterandreaderclassestakecharacterencodingsintoac-

count.

A BRIEF HISTORY OF CHARACTER SETS AND CHARACTER

ENCODINGS

EarlycomputersandprogramminglanguageswerecreatedmainlybyEnglish-speak-

ing programmers in countries where English was the native language. They deve-

lopedastandardmappingbetweencodepoints0through127andthe128commonly

used characters in the English language (e.g., A-Z). The resulting character set/en-

coding was named American Standard Code for Information Interchange (ASCII) .

TheproblemwithASCIIisthatit'sinadequateformostnon-Englishlanguages.For

example, ASCII doesn't support diacritical marks such as the cedilla used in the

Frenchlanguage.Becauseabytecanrepresentamaximumof256differentcharac-

ters, developers around the world started creating different character sets/encodings

thatencodedthe128ASCIIcharacters,butalsoencodedextracharacterstomeetthe

needsoflanguagessuchasFrench,Greek,orRussian.Overtheyears,manylegacy

(andstillimportant)fileshavebeencreatedwhosebytesrepresentcharactersdefined

by specific character sets/encodings.

TheInternationalOrganizationforStandardization(ISO)andtheInternationalElec-

trotechnicalCommission(IEC)haveworkedtostandardizetheseeight-bitcharacter

sets/encodings under a joint umbrella standard called ISO/IEC 8859. The result is

a series of substandards named ISO/IEC 8859-1, ISO/IEC 8859-2, and so on. For

example, ISO/IEC 8859-1 (also known as Latin-1) defines a character set/encoding

thatconsistsofASCIIplusthecharacterscoveringmostWesternEuropeancountries.

Search WWH ::

Custom Search

Home